In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. Before Unicode, there were hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters – each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.
Unicode is an attempt to include all the different schemes into one universal text-encoding standard.
Unicode represents each individual character as a unique code point with a unique number. For example, the character A – Latin Capital Letter A – is represented as U+0041 and the Hiragana ふ is U+3075.
An individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits. For example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.
The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.
Unicode also defines multiple encodings of its single character set: UTF-8, UTF-16, and UTF-32.
UTF-32/UCS-4 Encoding and Decoding
UTF-132/UCS-4 is a character encoding that maps code points of Unicode character to a sequence of 32-bit code units. UTF-32 uses a 32-bit code unit to represent a unicode character.
Code unit is a value that encoded code point to 8/16/32 bits to store/transmit Unicode text efficiently on a computer.
The algorithm to convert a unicode code point to UTF-32/UCS-4 sequence is as follows: represent the unicode code point as a 32-bit unsigned integer. For example, U+0041 becomes the UTF-32 sequence 0x00000041.
The algorithm to convert UTF-32/UCS-4 sequence to a unicode code point is as follows: represent the 32-bit code unit a unicode code point. For example, the UTF-32 sequence 0x00000041 becomes U+00000041. Leading zeros in unicode code points are omitted, unless the code point would have fewer than four hexadecimal digits. Thus the UTF-32 sequence 0x00000041 becomes U+0041.
Byte Order Mark (BOM)
UTF-32 comes in two variations, big endian and little endian. In big-endian UTF-32 or UTF-32BE, the most significant byte of the character comes first. In little-endian UTF-32 or UTF-32LE, the order is reversed. Thus, in big-endian UTF-32, the code point of letter A is U+0041 and the encoded bytes are 0x00, 0x00, 0x00, and 0x41. In little-endian UTF-32, the encoded bytes are swapped, and A is 0x41, 0x00, 0x00, and 0x00. In big-endian UTF-32, the code point of letter B is U+0042 and the encoded bytes are 0x00, 0x00, 0x00, and 0x42; in little-endian UTF-32, it's 0x42, 0x00, 0x00, and 0x00. In big-endian UTF-32, the code point of letter Σ is U+03A3 and the encoded bytes are 0x00, 0x00, 0x03, and 0xA3; in little-endian UTF-32, it's 0xA3, 0x03, 0x00, and 0x00. In big-endian UTF-32, the code point of emoticon 😀 is U+1F600 and the encoded bytes are 0x00, 0x01, 0xF6 and 0x00; in little-endian UTF-16, it's 0x00, 0xF6, 0x01, and 0x00. In big-endian UTF-16, the code point of character 𠂤 is U+200A4 and the encoded bytes are 0x00, 0x02, 0x00 and 0xA4; in little-endian UTF-16, it's 0xA4, 0x00, 0x02 and 0x00.
To distinguish between big-endian and little-endian UTF-32, a program can look at the first two bytes of a UTF-32 encoded document, a document encoded in big-endian UTF-32 begins with Unicode character U+FEFF, the zero-width nonbreaking space and the encoded bytes are 0x00, 0x00, 0xFE, and 0xFF. A document encoded in big-endian UTF-32 begins with the same Unicode character (U+FEFF) but the bytes is swapped 0xFF, 0xFE, 0x00 and 0x00.