Programming

Unicode

~2 mins read

Unicode is a standard for consistent encoding, representation, and handling of text from various writing systems and languages around the world.

β€œU+” followed by a series of hexadecimal digits represents a Unicode code point. For example, β€œU+0041” represents the Unicode code point for the Latin letter β€˜A’. The β€œU+” prefix is used to denote that the following digits represent a Unicode code point in hexadecimal notation.

UTF-8

UTF stands for Unicode Transformation Format. The β€˜8’ means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4.

UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size).

One of the really nice features of UTF-8 is that it is compatible with null-terminated strings. No character will have a null (0) byte when encoded. This means that C code that deals with char[] will β€œjust work”.

Binary format of bytes in sequence

1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx Β  Β  Β  7 007F hex (127)
110xxxxx 10xxxxxx Β  Β  (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx Β  (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

https://www.fileformat.info/info/unicode/utf8.htm

🎰