In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. Before Unicode, there were hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters – each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.

Unicode is an attempt to include all the different schemes into one universal text-encoding standard.

Unicode represents each individual character as a unique code point with a unique number. For example, the character ALatin Capital Letter A – is represented as U+0041 and the Hiragana is U+3075.

An individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits. For example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.

Unicode also defines multiple encodings of its single character set: UTF-8, UTF-16, and UTF-32.

UTF-16 Encoding

UTF-16 is a character encoding that maps code points of Unicode character to a sequence of one 16-bit code unit or two 16-bit code units. UTF-16 uses a 16-bit code unit, or two 16-bit code units to represent a unicode character.

Code unit is a value that encoded code point to 8/16/32 bits to store/transmit Unicode text efficiently on a computer.

When representing characters in UTF-16, each code point is represented by a sequence of one or two 16-bit code units. The number of bytes used depends on the code point being represented by the character. Here's a breakdown of the usage range:

  • code points in Basic Multilingual Plane (BMP) from the range U+0000 - U+D7FF (0-55295) and U+E000 - U+FFFF (57344-65535) are represented by two bytes
  • code points in other 16 supplementary planes from the range U+010000 - U+10FFFF (65536-1114111) are represented by four bytes

To encode characters outside the BMP, (e.g. U+29E3D, Japanese Kanji of Okhotsk Atka mackerel, 𩸽) UTF-16 encodes it using two 16-bit code units. This is called a surrogate pair.

The Unicode standard reserves BMP in the range from U+D800 to U+DFFF for surrogate pairs (this means that those code points are not assigned to any characters).

U+D800 - U+DBFF is known as high-surrogates, and U+DC00 - U+DFFF is known as low-surrogates. The first code unit of a surrogate pair is always in high-surrogates, and the second is always in low-surrogates.

The algorithm to convert a unicode code point to UTF-16 sequence is as follows:

  1. If the code point is between U+0000 and U+D7FF or between U+E000 and U+FFFF, it is encoded in a single 16-bit code unit.
    To encode a code point from U+0000 to U+D7FF and U+E000 to U+FFFF into UTF-16, you can simply represent the code point as UTF-16 sequence.
    For example, to encode the code point U+4E80, just simply represent the code point as UTF-16 sequence that is 0x4E80.
  2. If the code point is between U+010000 and U+10FFFF, it is encoded in two 16-bit code units.
    • Step 1: perform subtraction operation on the code point by 0x10000
    • Step 2: get the high-surrogate code unit of UTF-16 sequence
      • Substep 1: perform right shift operation on the result of step 1 by 10 bits
      • Substep 2: perform addition operation between the result of substep 1 with 0xD800 (0b11011000 00000000)
      • Substep 3: the high-surrogate code unit of UTF-16 sequence is the result of substep 2
    • Step 3: get the second low-surrogate code unit of UTF-16 sequence
      • Substep 1: perform masking using AND operator between the result of step 1 with 0x3FF (0b0011 11111111) to extract the trailing 10 bits of the code point
      • Substep 2: perform adding operation between the result of substep 1 with 0xDC00 (0b11011100 00000000)
      • Substep 3: the low-surrogate code unit of UTF-16 sequence is the result of substep 2
    • Final step: combine the result of step 1 and the result of step 2 to form a UTF-16 sequence.

    For example, encode the code point U+1F6A9 to the UTF-16 sequence:
    • Step 1: 0x1F6A9 - 0x10000 = 0xF6A9
    • Step 2: get the high-surrogate code unit of UTF-16 sequence
      • Substep 1: 0xF6A9 >> 10 = 0x3D
      • Substep 2: 0x3D + 0xD800 = 0xD83D
      • Substep 3: the high-surrogate code unit = 0xD83D
    • Step 3: get the second low-surrogate code unit of UTF-16 sequence
      • Substep 1: 0xF6A9 & 0x3FF = 0x02A9
      • Substep 2: 0x02A9 + 0xDC00 = 0xDEA9
      • Substep 3: the low-surrogate code unit = 0xDEA9
    • Final step: combine the UTF-16 sequence of the code point U+1F6A9 is 0xD83DDEA9.

The following table summarizes the conversion code points to UTF-16 sequence:

Character Binary code point Binary UTF-16 UTF-16 hex
code units
UTF-16BE
hex bytes
UTF-16LE
hex bytes
$ U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024 00 24 24 00
U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC 20 AC AC 20
𐐷 U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 11 00 0011 0111 D801 DC37 D8 01 DC 37 01 D8 37 DC
𤭢 U+24B62 0010 0100 1011 0110 0010 1101 10 00 0101 0010 1101 11 11 0110 0010 D852 DF62 D8 52 DF 62 52 D8 62 DF
UTF-16 Decoding

The algorithm to convert UTF-16 sequence to a unicode code point is as follows:

  1. If the UTF-16 sequence is between 0x0000 and 0xD7FF or between 0xE000 and 0xFFFF:
    To decode a UTF-16 sequence from 0x0000 to 0xD7FF and 0xE000 to 0xFFFF into a code point, you can simply represent the UTF-16 sequence as the code point.
    For example, to decode the code point 0x4E80, just simply represent the UTF-16 sequence as the code point that is U+4E80.
  2. If the UTF-16 sequence is between U+D800 and U+DFFF, it is a surrogate pair that consists of a high surrogate and a low surrogate.
    • Step 1: decode the high-surrogate code unit of UTF-16 sequence where the high-surrogate code unit must be between U+D800 and U+DBFF
      • Substep 1: perform subtraction operation on the high-surrogate code unit by 0xD800
      • Substep 3: perform the left shift operation on the reesult of substep 1 by 10 bits
      • Substep 4: the decoded high-surrogate code unit is the result of substep 3
    • Step 2: decode the second 16-bit code unit (low-surrogate code unit) of UTF-16 sequence which is between U+DC00 and U+DFFF
      • Substep 1: perform subtraction operation on the low-surrogate code unit by 0xDC00
      • Substep 2: the decoded low-surrogate code unit is the result of substep 1
    • Step 3: perform addition operation among the result of step 1, the result of step 2 and 0x10000 to get the code point
    • Final step: the code point is the result of step 3

    For example, decode the the UTF-16 sequence 0xD87DDEA9:
    • Step 1: decode the high-surrogate code unit of UTF-16 sequence where the high-surrogate code unit must be between U+D800 and U+DBFF
      • Substep 1: 0xD87D - 0xD800 = 0x007D
      • Substep 3: 0x007D << 10 = 0xF400
      • Substep 4: the decoded high-surrogate code unit = 0xF400
    • Step 2: decode the second 16-bit code unit (low-surrogate code unit) of UTF-16 sequence which is between U+DC00 and U+DFFF
      • Substep 1: 0xDEA9 - 0xDC00 = 0x02A9
      • Substep 2: the decoded low-surrogate code unit = 0x02A9
    • Step 3: 0xF400 + 0x02A9 + 0x10000 = 0x1F6A9
    • Final step: the code point = U+1F6A9
Byte Order Mark (BOM)

UTF-16 comes in two variations, big endian and little endian. In big-endian UTF-16 or UTF-16BE, the most significant byte of the character comes first. In little-endian UTF-16 or UTF-16LE, the order is reversed. Thus, in big-endian UTF-16, the code point of letter A is U+0041 and the encoded bytes are 0x0041. In little-endian UTF-16, the encoded bytes are swapped, and A is 0x4100. In big-endian UTF-16, the code point of letter B is U+0042 and the encoded bytes are 0x0042; in little-endian UTF-16, it's 0x4200. In big-endian UTF-16, the code point of letter Σ is U+03A3 and the encoded bytes are 0x03A3; in little-endian UTF-16, it's 0xA303. In big-endian UTF-16, the code point of emoticon 😀 is U+1F600 and the encoded bytes are 0xD83D and 0xDE00; in little-endian UTF-16, it's 0x3DD8 and 0x00DE. In big-endian UTF-16, the code point of character 𠂤 is U+200A4 and the encoded bytes are 0xD840 and 0xDCA4; in little-endian UTF-16, it's 0x40D8 and 0xA4DC.

To distinguish between big-endian and little-endian UTF-16, a program can look at the first two bytes of a UTF-16 encoded document, a document encoded in big-endian UTF-16 begins with Unicode character U+FEFF, the zero-width nonbreaking space, more commonly called the byte-order mark. A document encoded in big-endian UTF-16 begins with the same Unicode character (U+FEFF) but the bytes is swapped (i.e. U+FFFE), the resulting U+FFFE character doesn't actually exist.