Unicode In 5 Minutes
Unicode is a standard for digital processing of written characters and text. By enabling the exchange of text data internationally, it is a foundation for global software.
Unicode encodes characters, the smallest components of written language that have semantic value. It assigns each one a unique code point, or number. Code points are expressed as U+n where n is the number in hexadecimal.
Unicode does not encode font or stylistic differences. For example, most forms of the lower case 'a' are the same code point, U+0061. However, for compatibility with other systems, variants are sometimes given independent code points.
The full Unicode codespace supports over a million code points, of which about 100k are currently assigned. The vast majority of characters used in modern languages are allocated within the first 65,536 code points, called the Basic Multilingual Plane. Notably, the first 128 code points are the same as ASCII.
For Unicode text to be transmitted or stored, it must be encoded into a sequence of bytes. UTF-8 and UTF-16 are common encodings that can handle the entire range of characters. Both are variable length encodings where some characters take more bytes than others.
ASCII, LATIN1 and SJIS are common encodings that are outside of Unicode. They cannot encode all characters. ASCII has the interesting property that all ASCII sequences are valid UTF-8 sequences and encode the same characters.
For a large variety of reasons, some texts can be represented in Unicode with more than one character sequence. Depending on the intended processing, text may need to be normalized into a particular style of sequence. For example, people expect the word "file" to sort and compare the same no matter if the first two letters are represented as two characters, or as a ligature. Unicode defines normal forms to convert text into a regularized form for text processing.
The Unicode Standard is a fascinating document with many interesting write-ups about various writing systems, and pictures of fantastic characters. These are a few of my favorites.
- Peace in every language can be found at http://www.columbia.edu/~fdc/pace/
- More fun characters can be found at: http://www.inference.phy.cam.ac.uk/cjb/codepoints.html
- The Unicode web site has the freely available full standard, and wonderful code charts. Chapters 7 through 14 of the standard have interesting background information on each writing system supported.
- Thanks to Poppy Linden for introducing me to Spidery Ha