Canonicalization
A difficulty with input
validation and output encoding is ensuring that the data being
evaluated or transformed is in the format that will be interpreted as
intended by the end user of that input. A common technique for evading
input validation and output encoding controls is to encode the input
before it is sent to the application in such a way that it is then
decoded and interpreted to suit the attacker's aims. For example, Table 1 lists alternative ways to encode the single-quote character.
Table 1. Example Single-Quote Representations
Representation | Type of encoding |
---|
%27 | URL encoding |
%2527 | Double URL encoding |
%%317 | Nested double URL encoding |
%u0027 | Unicode representation |
%u02b9 | Unicode representation |
%ca%b9 | Unicode representation |
' | HTML entity |
' | Decimal HTML entity |
' | Hexadecimal HTML entity |
%26apos; | Mixed URL/HTML encoding |
In some cases, these are alternative encodings of the character (%27
is the URL-encoded representation of the single quote), and in other
cases these are double-encoded on the assumption that the data will be
explicitly decoded by the application (%2527 when URL-decoded will be %27 as shown in Table 8.6, as will %%317)
or are various Unicode representations, either valid or invalid. Not
all of these representations will be interpreted as a single quote
normally; in most cases, they will rely on certain conditions being in
place (such as decoding at the application, application server, WAF, or
Web server level), and therefore it will be very difficult to predict
whether your application will interpret them this way.
For these reasons, it is
important to consider canonicalization as part of your input validation
approach. Canonicalization is the process of reducing input to a
standard or simple form. For the single-quote examples in Table 1, this would normally be a single-quote character (‘).
Canonicalization Approaches
So, what alternatives
for handling unusual input should you consider? One method, which is
often the easiest to implement, is to reject all input that is not
already in a canonical format. For example, you can reject all HTML-and
URL-encoded input from being accepted by the application. This is one of
the most reliable methods in situations where you are not expecting
encoded input. This is also the approach that is often adopted by
default when you do whitelist input validation, as you may not accept
unusual forms of characters when validating for known good input. At the
very least, this could involve not accepting the characters used to
encode data (such as %, &, and # from the examples in Table 8.6), and therefore not allowing these characters to be input.
If
rejecting input that can contain encoded forms is not possible, you
need to look at ways to decode or otherwise make safe the input that you
receive. This may include several decoding steps, such as URL decoding
and HTML decoding, potentially repeated several times. This approach can
be error-prone, however, as you will need to perform a check after each
decoding step to determine whether the input still contains encoded
data. A more realistic approach may be to decode the input once, and
then reject the data if it still contains encoded characters. This
approach assumes that genuine input will not contain double-encoded
values, which should be a valid assumption in most cases.
Working with Unicode
When working with
Unicode input such as UTF-8, one approach is normalization of the input.
This converts the Unicode input into its simplest form, following a
defined set of rules. Unicode normalization differs from
canonicalization in that there may be multiple normal forms of a Unicode
character according to which set of rules is followed. The recommended
form of normalization for input validation purposes is NFKC
(Normalization Form KC – Compatibility Decomposition followed by
Canonical Composition). You can find more information on normalization
forms at www.unicode.org/reports/tr15.
The
normalization process will decompose the Unicode character into its
representative components, and then reassemble the character in its
simplest form. In most cases, it will transform double-width and other
Unicode encodings into their ASCII equivalents, where they exist.
You can normalize input in Java with the Normalizer class (since Java 6) as follows:
normalized = Normalizer.normalize(input, Normalizer.Form.NFKC);
You can normalize input in C# with the Normalize method of the String class as follows:
normalized = input.Normalize(NormalizationForm.FormKC);
You can normalize input in PHP with the PEAR::I18N_UnicodeNormalizer package from the PEAR repository, as follows:
$normalized = I18N_UnicodeNormalizer::toNFKC($input, 'UTF-8');
Another approach is to
first check that the Unicode is valid (and is not an invalid
representation), and then to convert the data into a predictable
format—for example, a Western European character set such as ISO-8859-1.
The input would then be used in that format within the application from
that point on. This is a deliberately lossy approach, as Unicode
characters that cannot be represented in the character set converted to
will normally be lost. However, for the purposes of making input
validation decisions, it can be useful in situations where the
application is not localized into languages outside Western Europe.
You can check for Unicode validity for UTF-8 encoded Unicode by applying the set of regular expressions shown in Table 2.
If the input matches any of these conditions it should be a valid UTF-8
encoding. If it doesn't match, the input is not a valid UTF-8 encoding
and should
be rejected. For other types of Unicode, you should consult the
documentation for the framework you are using to determine whether
functionality is available for testing the validity of input.
Table 2. UTF-8 Parsing Regular Expressions
Regular expression | Description |
---|
[x00-\x7F] | ASCII |
[\xC2-\xDF][\x80-\xBF] | Two-byte representation |
\xE0[\xA0-\xBF][\x80-\xBF] | Two-byte representation |
[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | Three-byte representation |
\xED[\x80-\x9F][\x80-\xBF] | Three-byte representation |
\xF0[\x90-\xBF][\x80-\xBF]{2} | Planes 1 through 3 |
[\xF1-\xF3][\x80-\xBF]{3} | Planes 4 through 15 |
\xF4[\x80-\x8F][\x80-\xBF]{2} | Plane 16 |
Now that you have
checked that the input is validly formed, you can convert it to a
predictable format—for example, converting a Unicode UTF-8 string to
another character set such as ISO-8859-1 (Latin 1).
In Java, you can use the CharsetEncoder class, or the simpler string method getBytes( ) (Java 6 and later) as follows:
string ascii = utf8.getBytes(“ISO-8859-1”);
In C#, you can use the Encoding.Convert class as follows:
ASCIIEncoding ascii = new ASCIIEncoding();
UTF8Encoding utf8 = new UTF8Encoding();
byte[] asciiBytes = Encoding.Convert(utf8, ascii, utf8Bytes);
In PHP, you can do this with utf8_decode as follows:
$ascii = utf8_decode($utf8string);