[APACHE DOCUMENTATION]

Cross Site Scripting Info: Encoding Examples

Introduction:

We trust you are already familiar with the Cross Site Scripting security problem and the concept behind how it works. If not, see the CERT Advisory CA-2000-02 that has been released on this issue for details before continuing.

This document focuses on how you can safely encode data before it is output to the client. The main method of doing this is through entity encoding, as described in the CERT advisory, using entities such as "<".

General Comments on Encoding:

Note that, in general, many functions that perform entity encoding do so in a way which is only suitable for use outside attribute values, in normal block level elements such as a paragraph of text. Many of the functions referenced below are in this category. This means they may not encode characters such as the double or single quote. If you don't use quotation marks around an attribute value supplied from user input, then you need to encode even more characters. Always use quotes and you won't have to worry about that particular issue.

Unfortunately, the situation for encoding data within attribute values or within the body scripts (eg. within "<SCRIPT>" tags) is more complex and less understood. If you are in this situation, you may be wise to consider filtering special characters (as described in the CERT Tech Tip) instead of encoding them. Generally, encoding is recommended because it does not require you to make a decision about what characters could legitimately be entered and need to be passed through and it has less of an impact on existing functionality.

The reason why safely encoding data within attribute values is difficult is because some characters that are not considered special characters can be arranged to have unexpected effects in certain attribute values. This is very specific to the tag the attribute is associated with and to how the client interprets it. For example, if you let the user enter the value for a HREF attribute, and you encode it properly, you could end up outputting a tag such as:

<A HREF="javascript:document.writeln(document.cookie + &quot;&lt;BR&gt;&quot;)">
Even though you have properly encoded special characters, many popular browsers will interpret a "javascript:" URL as containing JavaScript to execute in the context of the current document.

One of the issues that is still unresolved is exactly what HTML tags are "safe" to allow through, and what the algorithm for doing so is like. Many sites wish to allow users to enter a limited subset of "safe" HTML. This is still very much an open issue. It has been an issue for quite some time, and it is our hope that this Cross Site Scripting problem will help prompt more work into addressing it.

If you are encoding user entered data in a URL, then URL encoding (also known as percent encoding) is appropriate. Unfortunately, this can be a complex thing to get right because the special characters in "http://", for example, must remain unencoded because they are part of the syntax of the URL. Better solutions to deal with this are necessary.

Also note that some URL encoding functions encode a space into a "+" for historical reasons. This will only work in the query string for CGIs, and will not properly encode a space in other parts of the URL.

We realize that all these special situations and the lack of a single bulletproof set of steps for encoding user data, wherever it may occur on the page, makes the task of fixing this problem quite challenging in some cases. We wish we had a better answer, and are working on filling in the fuzzy areas.

PHP Example:

<?
$Text = "foo<b>bar";
$URL = "foo<b>bar.html";
echo HTMLSpecialChars($Text), "<BR>";     
echo "<A HREF=\"", rawurlencode($URL), "\">link</A>";
?>

Note that PHP also has a strip_tags() function that will remove all HTML tags from a string. Using this function in a manner such as:

	echo strip_tags($Text);
will strip all HTML from the input. However, if you use it in the form:
	echo strip_tags($Text, "<B>");
which only allows the "<B>" tag through, you are still often vulnerable to users inserting script code. By design, this function does not strip attributes from the tags. This means it is often possible to include things such as JavaScript event attributes. An example of a tag that would be allowed by the above strip_tags() call is:
	<B onmouseover="document.location='http://www.cert.org/'">

Some clients accept such attributes on tags that are otherwise benign.

Apache Module Example:

char *Text = "foo<b>bar";
char *URL = "foo<b>bar.html";
ap_rvputs(r, ap_escape_html(r->pool, Text), "<BR>", NULL);
ap_rvputs(r, "<A HREF=\"", ap_escape_uri(r->pool, URL), "\">link</A>", NULL);

mod_perl Example:

$Text = "foo<b>bar";
$URL = "foo<b>bar.html";
$r->print(Apache::Util::escape_html($Text), "<BR>");
$r->print("<A HREF=\"", Apache::Util::escape_uri($URL), "\">link</A>");

This uses the same functions as in the Apache Module Example, called from Perl instead of directly from C.

Perl Example:

use CGI ();
$Text = "foo<b>bar";
$URL = "foo<b>bar.html";
print CGI::escapeHTML($Text), "<BR>";
print qq(<A HREF="), CGI::escape($URL), qq(">link</A>);

Note that if you use the CGI.pm module in its full intended role, instead of just using helper functions from it, it will automatically encode special characters in many places. Unfortunately, this is yet again likely not sufficient in all situations. See the documentation at http://stein.cshl.org/WWW/software/CGI/ for more details on what this module can do.