In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. Before Unicode, there were hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters – each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.
Unicode is an attempt to include all the different schemes into one universal text-encoding standard.
Unicode represents each individual character as a unique code point with a unique number. For example, the character A – Latin Capital Letter A – is represented as U+0041 and the Hiragana ふ is U+3075
An individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits. For example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
Unicode also defines multiple encodings of its single character set: UTF-8, UTF-16, and UTF-32.
UTF-8 Encoding
UTF-8 is a character encoding that maps code points of Unicode character to a sequence of one, two, three or four 8-bit code units. UTF-8 uses 1, 2, 3 or 4 8-bit code units to represent a unicode character.
Code unit is a value that encoded code point to 8/16/32 bits to store/transmit Unicode text efficiently on a computer.
When representing characters in UTF-8, each code point is represented by a sequence of one or more bytes. The number of bytes used depends on the code point being represented by the character. Here's a breakdown of the usage range:
- code points in the ASCII range U+0000 - U+007F (0-127) are represented by a single byte
- code points in the range U+0080 - U+07FF (128-2047) are represented by two bytes
- code points in the range U+0800 - U+FFFF (2048-65535) are represented by three bytes
- code points in the range U+010000 - U+10FFFF (65536-1114111) are represented by four bytes
In the following table, the x characters are replaced by the bits of the code point:
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+010000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The algorithm to convert a unicode code point to UTF-8 sequence is as follows:
- If the code point is less than U+0080 or between U+0000 and U+007F, it is encoded in a single byte.
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 0 bit
- Substep 2: perform masking using AND operator between the result of substep 1 with 0x7F (0b01111111) to extract the 7 bits in the result of step 1
- Substep 3: perform masking using OR operator between the result of substep 2 with 0
- Substep 4: the first byte of UTF-8 sequence is the result of substep 3
- Final step: the UTF-8 sequence is the result of step 1
For example, encoding the character $ (U+0024):
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: 0x24 >> 0 = 0x80
- Substep 2: 0x24 & 0x7F = 0x24
- Substep 3: 0x24 | 0x0 = 0x24
- Substep 4: first byte of UTF-8 sequence = 0x24
- Final step: the UTF-8 sequence of the character $ (U+0024) is 0x24
As you can see, for a single-byte UTF-8 sequence, the first byte is the code point itself. - Step 1: get the first byte of UTF-8 sequence
- If the code point is less than U+0800 or between U+0080 and U+07FF, it is encoded in two bytes.
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 6 bits
- Substep 2: perform masking using AND operator between the result of substep 1 with 0x1F (0b00011111) to extract the trailing 5 bits of the result of substep 1
- Substep 3: perform masking using OR operator between the result of substep 2 with 0xC0 (0b11000000) to add the leading 3 bits (i.e. 110) to the result of substep 2
- Substep 4: the first byte of UTF-8 sequence is the result of substep 3
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 0 bit
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the second byte of UTF-8 sequence is the result of substep 3
- Final step: combine the result of step 1 and the result of step 2 to form a UTF-8 sequence.
For example, encoding the Character ü (U+00FC):
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: 0xfc >> 6 = 0x03
- Substep 2: 0x03 & 0x1f = 0x03
- Substep 3: 0x03 | 0xc0 = 0xc3
- Substep 4: first byte of UTF-8 sequence = 0xc3
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: 0xfc >> 0 = 0xfc
- Substep 2: 0xfc & 0x3f = 0x3c
- Substep 3: 0x3c | 0x80 = 0xbc
- Substep 3: second byte of UTF-8 sequence = 0xbc
- Final step: the UTF-8 sequence of the character ü (U+00FC) is 0xc3bc
- Step 1: get the first byte of UTF-8 sequence
- If the code point is less than U+010000 or between U+0800 and U+FFFF, it is encoded in three bytes.
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: perform the right shift operation on the code point by 12 bits
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x0F (0b00001111) to extract the trailing 4 bits of the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0xE0 (0b11100000) to add the leading 4 bits (i.e. 1110) to the result of substep 2
- Substep 4: the first byte of UTF-8 sequence is the result of substep 3
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 6 bits
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the second byte of UTF-8 sequence is the result of substep 3
- Step 3: get the third byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 0 bit
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the third byte of UTF-8 sequence is the result of substep 3
- Final step: combine the first byte, second byte and the third byte to form a UTF-8 sequence
For example, encoding the Character € (U+20AC):
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: 0x20ac >> 12 = 0x02
- Substep 2: 0x02 & 0xf = 0x02
- Substep 3: 0x02 | 0xe0 = 0xe2
- Substep 4: first byte of UTF-8 sequence = 0xe2
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: 0x20ac >> 6 = 0x82
- Substep 2: 0x82 & 0x3f = 0x02
- Substep 3: 0x02 | 0x80 = 0x82
- Substep 4: second byte of UTF-8 sequence = 0x82
- Step 3: get the third byte of UTF-8 sequence
- Substep 1: 0x20ac >> 0 = 0x20ac
- Substep 2: 0x20ac & 0x3f = 0x2c
- Substep 3: 0x2c | 0x80 = 0xac
- Substep 4: third byte of UTF-8 sequence = 0xac
- Final step: the UTF-8 sequence of the character € (U+20AC) is 0xe282ac
- Step 1: get the first byte of UTF-8 sequence
- If the code point is less than U+110000 or between U+010000 and U+10FFFF, it is encoded in four bytes.
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 18 bits
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x07 (0b00000111) to extract the trailing 3 bits of the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0xF0 (0b11110000) to add the leading 5 bits (i.e. 11110) to the result of substep 2
- Substep 4: the first byte of UTF-8 sequence is the result of substep 3
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 12 bits
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the second byte of UTF-8 sequence is the result of substep 3
- Step 3: get the third byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 6 bits
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the third byte of UTF-8 sequence is the result of substep 3
- Step 4: get the fourth byte of UTF-8 sequence
- Substep 1: perform right shift operation on the code point by 0 bit
- Substep 2: perform masking using AND operator between the result of Substep 1 with 0x3F (0b00111111) to extract the trailing 6 bits in the result of substep 1
- Substep 3: perform masking using OR operator between the result of Substep 2 with 0x80 (0b11000000) to add leading 2 bits (i.e. 10) to the result of substep 2
- Substep 4: the fourth byte of UTF-8 sequence is the result of substep 3
- Final step: combine the first byte, second byte third byte, and the forth byte to form a UTF-8 sequence
For example, encoding the Character 😀 (U+1F600):
- Step 1: get the first byte of UTF-8 sequence
- Substep 1: 0x1f600 >> 18 = 0x0
- Substep 2: 0x0 & 0x07 = 0x0
- Substep 3: 0x0 | 0xf0 = 0xf0
- Substep 4: first byte of UTF-8 sequence = 0xf0
- Step 2: get the second byte of UTF-8 sequence
- Substep 1: 0x1f600 >> 12 = 0x1f
- Substep 2: 0x1f & 0x3f = 0x1f
- Substep 3: 0x1f | 0x80 = 0x9f
- Substep 4: second byte of UTF-8 sequence = 0x9f
- Step 3: get the third byte of UTF-8 sequence
- Substep 1: 0x1f600 >> 6 = 0x7d8
- Substep 2: 0x7d8 & 0x3f = 0x18
- Substep 3: 0x18 | 0x80 = 0x98
- Substep 4: third byte of UTF-8 sequence = 0x98
- Step 4: get the fourth byte of UTF-8 sequence
- Substep 1: 0x1f600 >> 0 = 0x1f600
- Substep 2: 0x1f600 & 0x3f = 0x0
- Substep 3: 0x0 | 0x80 = 0x80
- Substep 4: fourth byte of UTF-8 sequence = 0x80
- Final step: the UTF-8 sequence of the character 😀 (U+1F600) is 0xf09f9880
- Step 1: get the first byte of UTF-8 sequence
The following table summarizes the conversion code points to UTF-8 sequence:
Character | Binary code point | Binary UTF-8 | Hex UTF-8 | |
---|---|---|---|---|
$ | U+0024 | 010 0100 | 00100100 | 24 |
ü | U+00FC | 000 1111 1100 | 11000011 10111100 | C3 BC |
€ | U+20AC | 00100000 1010 1100 | 11100010 10000010 10101100 | E2 82 AC |
😀 | U+1F600 | 0 0111 11011000 0000 0000 | 11110001 10111101 10100000 10000000 | F0 9F 98 80 |
UTF-8 Decoding
The algorithm to convert UTF-8 sequence to a unicode code point is as follows:
- If the first byte of UTF-8 sequence is between 0x0 and 0x7F, the length of the UTF-8 sequence is a single byte. The maximum number of bits in a byte is 8 bits and is represented as two hexadecimal digits.
- Step 1: decode the first byte
- Substep 1: represent the first byte of the UTF-8 sequence as two hexadecimal digits
- Substep 2: perform masking using AND operator between the first byte UTF-8 sequence with 0x7F (0b01111111) to extract the trailing 7 bits of the first byte UTF-8 sequence
- Substep 3: perform lef shift on the result of Substep 2 by 0 bit
- Substep 4: the decoded first byte is the result of substep 3
- Final step: the code point of the UTF-8 sequence is the result of step 1
For example, decoding the UTF-8 sequence 0x24:
- Step 1: decode the first byte
- Substep 1: the first byte of the UTF-8 sequence = 0x24
- Substep 2: 0x24 & 0x7f = 0x24
- Substep 3: 0x24 << 0 = 0x24
- Substep 4: the decoded first byte = 0x24
- Final step: the code point of the UTF-8 sequence 0x24 is U+0024
As you can see, for a single-byte UTF-8 sequence, the first byte is the code point itself. - Step 1: decode the first byte
- If the first byte of UTF-8 sequence is between 0xC0 and 0xDF, the length of the UTF-8 sequence is two bytes. The maximum number of bits in two bytes is 16 bits and is represented as four hexadecimal digits.
- Step 1: decode the first byte
- Substep 1: represent the first byte of the UTF-8 sequence as four hexadecimal digits
- Substep 2: perform masking using AND operator between the first byte UTF-8 sequence with 0x1F (0b00011111) to extract the trailing 5 bits of the first byte UTF-8 sequence
- Substep 3: do the shift left the result of step 2 by 6 bits
- Substep 4: the decoded first byte is the result of substep 3
- Step 2: decode the second byte
- Substep 1: represent the first byte of the UTF-8 sequence as four hexadecimal digits
- Substep 2: perform masking using AND operator between the second byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the second byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 0 bit
- Substep 4: the decoded second byte is the result of substep 3
- Step 3: perform the OR operation among the result of step 1 and the result of step 2 to get the code point value.
- Final step: the code point of UTF-8 sequence is the result of step 3
For example, decoding the UTF-8 sequence 0xc3bc:
- Step 1: decode the first byte
- Substep 1: the first byte of the UTF-8 sequence = 0x00c3
- Substep 2: 0x00c3 & 0x1f = 0x0003
- Substep 3: 0x0003 << 6 = 0x00c0
- Substep 4: the decoded first byte is 0x00c0
- Step 2: decode the second byte
- Substep 1: the second byte of the UTF-8 sequence = 0x00bc
- Substep 1: 0x00bc & 0x3f = 0x003c
- Substep 2: 0x003c << 0 = 0x003c
- Substep 3: the decoded second byte is 0x003c
- Step 3: 0x00c0 | 0x003c = 0x00fc
- Final step: the code point of UTF-8 sequence 0xc3bc is U+00FC
- Step 1: decode the first byte
- If the first byte of UTF-8 sequence is between 0xE0 and 0xEF, the length of the UTF-8 sequence is three bytes. The maximum number of bits in three bytes is 24 bits and is represented as six hexadecimal digits.
- Step 1: decode the first byte
- Substep 1: represent the first byte of the UTF-8 sequence as six hexadecimal digits
- Substep 2: perform masking using AND operator between the first byte UTF-8 sequence with 0x0F (0b00001111) to extract the trailing 4 bits of the first byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 12 bits
- Substep 4: the decoded first byte is the result of substep 3
- Step 2: decode the second byte
- Substep 1: represent the first byte of the UTF-8 sequence as six hexadecimal digits
- Substep 2: perform masking using AND operator between the second byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the second byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 6 bit
- Substep 4: the decoded second byte is the result of substep 3
- Step 3: decode the third byte
- Substep 1: represent the first byte of the UTF-8 sequence as six hexadecimal digits
- Substep 2: perform masking using AND operator between the third byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the third byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 6 bit
- Substep 4: the decoded third byte is the result of substep 3
- Step 4: perform the OR operation among the result of step 1, the result of step 2 and the result of the step 3 to get the code point value.
- Final step: the code point of UTF-8 sequence is the result of step 4
For example, decoding the UTF-8 sequence 0xe282ac:
- Step 1: decode the first byte
- Substep 1: the first byte of the UTF-8 sequence = 0x0000e2
- Substep 2: 0x0000e2 & 0x0f = 0x000002
- Substep 3: 0x000002 << 12 = 0x002000
- Substep 4: the decoded first byte is 0x002000
- Step 2: decode the second byte
- Substep 1: the second byte of the UTF-8 sequence = 0x000082
- Substep 2: 0x000082 & 0x3f = 0x000002
- Substep 3: 0x000002 << 6 = 0x000080
- Substep 4: the decoded second byte is 0x000080
- Step 3: decode the third byte
- Substep 1: the third byte of the UTF-8 sequence = 0x0000ac
- Substep 2: 0x0000ac & 0x3f = 0x00002c
- Substep 3: 0x00002c << 0 = 0x00002c
- Substep 4: the decoded second byte is 0x00002c
- Step 4: 0x002000 | 0x000080 | 0x00002c = 0x0020ac
- Final step: the code point of UTF-8 sequence 0xc3bc is U+20AC
- Step 1: decode the first byte
- If the first byte of UTF-8 sequence is between 0xF0 and 0xF7, the length of the UTF-8 sequence is four bytes. The maximum number of bits in four bytes is 32 bits and is represented as eight hexadecimal digits.
- Step 1: decode the first byte
- Substep 1: represent the first byte of the UTF-8 sequence as eight hexadecimal digits
- Substep 2: perform masking using AND operator between the first byte UTF-8 sequence with 0x07 (0b00000111) to extract the trailing 3 bits of the first byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 18 bits
- Substep 4: the decoded first byte is the result of substep 3
- Step 2: decode the second byte
- Substep 1: represent the second byte of the UTF-8 sequence as eight hexadecimal digits
- Substep 2: perform masking using AND operator between the second byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the second byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 12 bits
- Substep 4: the decoded second byte is the result of substep 3
- Step 3: decode the third byte
- Substep 1: represent the third byte of the UTF-8 sequence as eight hexadecimal digits
- Substep 2: perform masking using AND operator between the third byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the third byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 6 bits
- Substep 4: the decoded third byte is the result of substep 3
- Step 4: decode the fourth byte
- Substep 1: represent the fourth byte of the UTF-8 sequence as eight hexadecimal digits
- Substep 2: perform masking using AND operator between the fourth byte UTF-8 sequence with 0x3F (0b00111111) to extract the trailing 6 bits of the fourth byte UTF-8 sequence
- Substep 3: perform shift left the result of step 2 by 6 bits
- Substep 4: the decoded fourth byte is the result of substep 3
- Step 5: perform the OR operation among the result of step 1, the result of step 2, the result of the step 3 and the result of the step 4 to get the code point value.
- Final step: the code point of UTF-8 sequence is the result of step 4
For example, decoding the UTF-8 sequence 0xf09f9880:
- Step 1: decode the first byte
- Substep 1: the first byte of the UTF-8 sequence = 0x000000f0
- Substep 2: 0x000000f0 & 0x07 = 0x00000000
- Substep 3: 0x00000000 << 18 = 0x00000000
- Substep 4: the decoded first byte is 0x00000000
- Step 2: decode the second byte
- Substep 1: the second byte of the UTF-8 sequence = 0x0000009f
- Substep 2: 0x0000009f & 0x3f = 0x0000001f
- Substep 3: 0x0000001f << 12 = 0x0001f000
- Substep 4: the decoded second byte is 0x0001f000
- Step 3: decode the third byte
- Substep 1: the third byte of the UTF-8 sequence = 0x00000098
- Substep 2: 0x00000098 & 0x3f = 0x00000018
- Substep 3: 0x00000018 << 6 = 0x00000600
- Substep 4: the decoded second byte is 0x00000600
- Step 4: decode the fourth byte
- Substep 1: the third byte of the UTF-8 sequence = 0x00000080
- Substep 2: 0x00000080 & 0x3f = 0x00000000
- Substep 3: 0x00000000 << 0 = 0x00000000
- Substep 4: the decoded second byte is 0x00000000
- Step 5: 0x00000000 | 0x0001f000 | 0x00000600 | 0x00000000 = 0x0001f600
- Final step: the code point of UTF-8 sequence 0xf09f9880 is U+1F600
- Step 1: decode the first byte
Byte Order Mark (BOM)
The Byte Order Mark (BOM) in UTF-8 is represented by the Unicode code point U+FEFF. In UTF-8, this code point is encoded as the following three-byte sequence: EF BB BF. This sequence is placed at the beginning of a file or stream to indicate that the contents are encoded in UTF-8. UTF-8 has no Litte-Endian or Big-Endian variants.
For example, Suppose we have a text file containing the string "Hello, World!" encoded in UTF-8. The file would start with the BOM sequence EF BB BF, followed by the UTF-8 encoding of the string:
EF BB BF 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21
UTF-8 Encoding and Decoding Algorithm in Programming Languages
C Programming Language
In C programming language, the UTF-8 encoding and decoding algorithm look as follow:
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
typedef struct {
char mask; /* char data will be bitwise AND with this */
char lead; /* start bytes of current char in utf-8 encoded character */
uint32_t beg; /* beginning of codepoint range */
uint32_t end; /* end of codepoint range */
int bits_stored; /* the number of bits from the codepoint that fits in char */
}utf_t;
utf_t * utf[] = {
/* mask lead beg end bits */
[0] = &(utf_t){0b00111111, 0b10000000, 0, 0, 6 },
[1] = &(utf_t){0b01111111, 0b00000000, 0000, 0177, 7 },
[2] = &(utf_t){0b00011111, 0b11000000, 0200, 03777, 5 },
[3] = &(utf_t){0b00001111, 0b11100000, 04000, 0177777, 4 },
[4] = &(utf_t){0b00000111, 0b11110000, 0200000, 04177777, 3 },
&(utf_t){0},
};
/* All lengths are in bytes */
int codepoint_len(const uint32_t cp); /* len of associated utf-8 char */
int utf8_len(const char ch); /* len of utf-8 encoded char */
char *to_utf8(const uint32_t cp);
uint32_t to_cp(const char chr[4]);
int codepoint_len(const uint32_t cp)
{
int len = 0;
for(utf_t **u = utf; *u; ++u) {
if((cp >= (*u)->beg) && (cp <= (*u)->end)) {
break;
}
++len;
}
if(len > 4) /* Out of bounds */
exit(1);
return len;
}
int utf8_len(const char ch)
{
int len = 0;
for(utf_t **u = utf; *u; ++u) {
if((ch & ~(*u)->mask) == (*u)->lead) {
break;
}
++len;
}
if(len > 4) { /* Malformed leading byte */
exit(1);
}
return len;
}
char *to_utf8(const uint32_t cp)
{
static char ret[5];
const int bytes = codepoint_len(cp);
int shift = utf[0]->bits_stored * (bytes - 1);
ret[0] = (cp >> shift & utf[bytes]->mask) | utf[bytes]->lead;
shift -= utf[0]->bits_stored;
for(int i = 1; i < bytes; ++i) {
ret[i] = (cp >> shift & utf[0]->mask) | utf[0]->lead;
shift -= utf[0]->bits_stored;
}
ret[bytes] = '\0';
return ret;
}
uint32_t to_cp(const char chr[4])
{
int bytes = utf8_len(*chr);
int shift = utf[0]->bits_stored * (bytes - 1);
uint32_t codep = (*chr++ & utf[bytes]->mask) << shift;
for(int i = 1; i < bytes; ++i, ++chr) {
shift -= utf[0]->bits_stored;
codep |= ((char)*chr & utf[0]->mask) << shift;
}
return codep;
}
int main(void)
{
const uint32_t *in, input[] = {0x0041, 0x00f6, 0x0416, 0x20ac, 0x1d11e, 0x0};
printf("Character Unicode UTF-8 encoding (hex)\n");
printf("----------------------------------------\n");
char *utf8;
uint32_t codepoint;
for(in = input; *in; ++in) {
utf8 = to_utf8(*in);
codepoint = to_cp(utf8);
printf("%s U+%-7.4x", utf8, codepoint);
for(int i = 0; utf8[i] && i < 4; ++i) {
printf("%hhx ", utf8[i]);
}
printf("\n");
}
return 0;
}
Compile and run the program:
gcc utf8.c && ./a.out
The output is as follows:
Character Unicode UTF-8 encoding (hex)
----------------------------------------
A U+0041 41
ö U+00f6 c3 b6
Ж U+0416 d0 96
€ U+20ac e2 82 ac
𝄞 U+1d11e f0 9d 84 9e