UCS-2 Encoding and Decoding Algorithm

In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. Before Unicode, there were hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters – each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.

Unicode is an attempt to include all the different schemes into one universal text-encoding standard.

Unicode represents each individual character as a unique code point with a unique number. For example, the character A – Latin Capital Letter A – is represented as U+0041 and the Hiragana ふ is U+3075

An individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits. For example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.

UCS-2 Encoding and Decoding

UCS-2, also known as ISO-10646-UCS-2, represents each unicode code point as a two-byte, unsigned integer between 0 and 65,535.

UCS-2 can only represent the code points of the first plane (U+0000 to U+FFFF).

UCS-2 encoding and decoding are straightforward. Thus the capital letter A, code point U+0041 in Unicode, is represented by the two bytes 0x0041. The capital letter B, code point U+0042, is represented by the two bytes 0x0042. The two bytes 0x03A3 represent the capital Greek letter Σ, code point U+03A3.

Byte Order Mark (BOM)

UCS-2 comes in two variations, big endian and little endian. In big-endian UCS-2, the most significant byte of the character comes first. In little-endian UCS-2, the order is reversed. Thus, in big-endian UCS-2, the letter A is U+0041. In little-endian UCS-2, the bytes are swapped, and A is U+4100. In big-endian UCS-2, the letter B is U+0042; in little-endian UCS-2, it's U+4200. In big-endian UCS-2, the letter Σ is U+03A3; in little-endian UCS-2, it's U+A303.

To distinguish between big-endian and little-endian UCS-2, a program can look at the first two bytes of a UCS-2 encoded document, a document encoded in big-endian UCS-2 begins with Unicode character U+FEFF, the zero-width nonbreaking space, more commonly called the byte-order mark. A document encoded in big-endian UCS-2 begins with the same Unicode character (U+FEFF) but the bytes is swapped (i.e. U+FFFE), the resulting U+FFFE character doesn't actually exist.