Dreams and Glories!: Everybody Loves Unicode.

【From Wikipedia】
The Unicode Standard, Unicode consists of a character repertoire, an encoding methodology and set of standard character encodings, a set of code charts for visual reference, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and rules for normalization, decomposition, collation and rendering.

The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes.

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software.

The standard has been implemented in many recent technologies, including XML, the Java programming language and modern operating systems.

The Unicode code space for characters is divided into 17 planes, each with 65,536 (= 2 's 16 product) code points, although currently only a few planes are used:

* Plane 0 (0000�CFFFF): Basic Multilingual Plane (BMP)
* Plane 1 (10000�C1FFFF): Supplementary Multilingual Plane (SMP)
* Plane 2 (20000�C2FFFF): Supplementary Ideographic Plane (SIP)
* Planes 3 to 13 (30000�CDFFFF) are unassigned
* Plane 14 (E0000�CEFFFF): Supplementary Special-purpose Plane (SSP)
* Plane 15 (F0000�CFFFFF) reserved for the Private Use Area (PUA)
* Plane 16 (100000�C10FFFF), reserved for the Private Use Area (PUA)

Unicode defines two mapping methods:
* the UTF (Unicode Transformation Format) encodings
* the UCS (Universal Character Set) encodings

The encodings include:
* UTF-7 ― a relatively unpopular 7-bit encoding, often considered obsolete
* UTF-8 ― an 8-bit, variable-width encoding, which maximizes
compatibility with ASCII.
* UTF-EBCDIC ― an 8-bit variable-width encoding, which maximizes
compatibility with EBCDIC.
* UCS-2 ― a 16-bit, fixed-width encoding that only supports the
BMP, considered obsolete
* UTF-16 ― a 16-bit, variable-width encoding
* UCS-4 and UTF-32 ― functionally identical 32-bit fixed-width encodings

Multilingual text-rendering engines
* Uniscribe ― Windows
* Apple Type Services for Unicode Imaging ― new engine for Macintosh
* WorldScript ― old engine for Macintosh
* Pango ― Open Source, used by GTK+ (and hence GNOME)
* ICU Layout Engine ― Open Source
* Graphite ― (Open Source renderer from SIL)
* Scribe ― Open Source renderer from Trolltech

UTF-8 {1,2,3,4} byte
-----

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any universal character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is consistent with ASCII (requiring little or no change for software that handles ASCII but preserves other values). For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

UTF-8 uses one to four bytes per character[1~4 octets], depending on the Unicode symbol.

Only one byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F).
Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
Four bytes are needed for characters in other planes of Unicode.

UTF-16 {2,4} byte
-----

In computing, UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire.

The encoding form maps code points (characters) into a sequence of 16-bit words, called code units. For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word.

For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair.

CESU-8
-----

CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report #26.

A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8.

Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code (2 bytes).

Java
-----

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.

However, Java also supports a non-standard variant of UTF-8 called modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files.

There are two differences between modified and standard UTF-8.

The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, presumably to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated.

The second difference is in the way characters outside the Basic Multilingual Plane are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8. The reason for this modification is more subtle. In Java a character is 16 bits long; therefore some Unicode characters require two Java characters in order to be represented. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one UTF-16 code unit at a time, rather than one Unicode code point at a time. Unfortunately, this also means that characters requiring four bytes in UTF-8 require six bytes in modified UTF-8.

Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet.

Dreams and Glories!

Tuesday, April 10, 2007

Everybody Loves Unicode.

No comments:

About Me

Labels

Blog Archive

Links

Site Meters

Photos

Clock