UnicodeChecker Icon

Escapes

UnicodeChecker provides escaping and unescaping functions using various formats. You can either use the according “String Utility” (from the “File” menu) or AppleScript commands for (un-)escaping.

The default set of characters that will not be escaped for each format is specified in the following list of formats. The AppleScript commands allow you to specify exactly which characters should be preserved when escaping a string.

Note that UnicodeChecker only (un-)escapes Unicode escapes – it will not touch other escapes like \" etc. Also note that unescaping and then re-escaping a string will not always result in the original string due to some degrees of freedom in the specifications. However, escaping and then unescaping should always result in the original string. There are quite a lot of different escape formats for different programming languages, etc.

If you require a specific format not supported by UnicodeChecker yet, please send an e-mail.

Some details on the different supported formats:

CSS 1

The CSS 1 Recommendation states that the backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. This is also expressed by the CSS 1 grammar. This means that codepoints outside the Basic Multilinugal Plane (i.e. greater than 0xFFFF) must be represented using the corresponding surrogate pair.

Default set of characters that will not be escaped: U+0000..007F.

Examples:

  • H\65lloHello
  • 10\650210攂
  • 10\00650210e02
CSS 2

The CSS 2 Recommendation extends Unicode escapes to at most six hexadecimal characters (grammar) so higher plane codepoints can be represented using a single escape. For simplicity UnicodeChecker will always use all 6 digits when escaping.

The first whitespace character after a Unicode escape is ignored according to CSS 2, therefore UnicodeChecker inserts a space character after each escaped character in order to preserve whitespaces from the original string (although it might not always be strictly necessary). CSS 2 considers ' ', '\t', '\n', '\f' and '\r' as whitespace (they seemingly forgot the combination '\r\n' and added it in CSS 2.1).

Default set of characters that will not be escaped: U+0000..007F.

Examples:

  • \000041BCABC
  • \000041␣BCABC
  • \41BC
  • \41 BCABC
  • \41  BCA BC
CSS 2.1

The only difference to CSS 2 is that '\r\n' is also considered a single whitespace after a Unicode escape. (Recommendation, grammar).

Default set of characters that will not be escaped: U+0000..007F.

C99

The Programming Language C international standard ISO/IEC 9899:1999 (or C99 for short) in its current form is usually not available publically but you should be able to download a PDF file containing the standard including recent technical corrigenda.

Therein, Unicode escapes are defined as \unnnn and \Unnnnnnnn (note the case-sensitivity of “u” vs. “U”), where escapes “shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive” (Section 6.4.3 Universal character names). U+D800..DFFF contains the surrogate pair blocks, therefore codepoints outside the Basic Multilingual Plane will always be represented using one \U escape instead of a surrogate pair. However, UnicodeChecker will also unescape any surrogate pairs in the input string. If less than the required number of hexadecimal digits are found when unescaping \u or \U UnicodeChecker will leave the escape sequence untouched.

Default set of characters that will not be escaped: U+0000..009F.

C (octal UTF-8)

In addition to real Unicode escapes, the Programming Language C specifies octal escape sequences in the forms \n, \nn and \nnn, where the n are octal digits (0–7). When using UTF-8 encoding for such escape sequences, Cocoa developers can use NSString’s –initWithUTF8String: or +stringWithUTF8String: methods to programatically create Unicode strings.

Default set of characters that will not be escaped: U+0000..007F.

Java

According to the Java Language Specification “Java translates “the ASCII characters \u followed by four hexadecimal digits to the Unicode character with the indicated hexadecimal value””.

As exactly four hexadecimal digits are required, UnicodeChecker will ignore sequences with less than 4 digits when unescaping. The specification further states that multiple “u” are also allowed and that Java “specifies a standard way of transforming a program written in Unicode into ASCII” where an extra “u” is added for each escape already found in the input string. UnicodeChecker will not add this extra “u” when escaping and escape sequences with multiple “u” will always be unescaped to the corresponding Unicode codepoint.

Default set of characters that will not be escaped: U+0000..007F.

Examples:

  • \u0046GHFGH
  • \uuu0046GHFGH
  • \u46GH\u46GH
URL (UTF-8)

URL escapes are defined in RFC 2396 section 2.4.1 “as a character triplet, consisting of the percent character '%' followed by the two hexadecimal digits representing the octet code”. However, RFC 2396 does not specify which encoding to use in actual URIs. UnicodeChecker – as its name implies – always assumes UTF-8 encoding.

Default set of characters that will not be escaped: !$&'()*+,-./0-9:;=?@A-Z_a-z~ (this corresponds to U+0021, U+0024, U+0026..003B, U+003D, U+003F..005A, U+005F, U+0061..007A, U+007E).

Example: A%20BA B