UEFI News and Commentary

Sunday, December 06, 2009

UEFI HII (Part 7): Character Encoding

How do you know that the character 'A' corresponds to the character value 0x0041? Or that character value 0x215d is the character '⅝'? Well, standard, such as ASCII or the Unicode standard (or, going further back EBCDIC) describe the mapping between a numeric value and a specific characer.

But how to convert those numeric values into actual bits and bytes? 7-bits, 8-bits, 16-bits, 32-bits? And if more than one byte, is it big-endian or little-endian? That conversion process is called encoding.

I have the Unicode 1.0 specification sitting on my shelf. At that time, there was some optimism that all the character values that anyone would ever need could be contained in 16-bits: 65,536 character values. But even then, there were some signs that people were inventing and had invented many more character glyphs than could be contained. For example, in some scripts, individual cities had their own glyphs. And what about the Mahjong tile characters (0x1F000-0x1F002b)? So, prior to the Unicode 2.0 specification, the predominant form of encoding (known as UCS-2) embodied this 16-bit assumption.

But prior to that point, many operating systems (such as Windows NT) and firmware specifications (such as EFI 1.10) had taken root. While most operating systems have since migrated to the preferred encoding standard (UTF-16), which can handle the full set of Unicode character values, the UEFI specification still retains UCS-2.

So what is the big difference between UCS-2 and UTF-16? They are both 16-bit encoding schemes. For all the character values we care about, they are identical. Most of these characterThe real difference comes in how they handle character values beyond 0x10000 (that is, beyond character 65,536).

For UCS-2, these characters don't exist. There is no proper way to encode charcter 0x10001.  For UTF-16, characters values more than 0x10000 are encoded by combining two adjacent 16-bit characters. No one was particularly happy about this solution, since it took everyone back into the multi-byte character encoding nightmare that had plagued so many previous standards attempts (Shift-JI1, GB5, Windows code pages, etc.). But the alternative was to require a 32-bit unsigned integer to represent the characters, which would add a lot of bloat. But the advantage of this technique was that only a limit range of character values could appear as the first half of a surrogate pair (0xD800-0xDBFF) and another limited range of character values could appear as the second half (0xDC00-0xDFFF).  To calculate a character value, you take 6 bits from each and add 0x10000 to the result. This gives you a range of possible character values between 0x0000000-0xFFFFFFF. By separating the surrogate pair values in to first-half and second-half allowed string processing functions to work no matter where in the string they started from and which direction they processed the string. For more information, you can read the FAQ at the Unicode site.

But, for the purposes of the UEFI specification, those surrogate pair character values have no special meaning and would each be treated as a separate character. That is, 0xD800 0xDC00 would be a single character in UTF-16, but two characters under UEFI and UCS-2.

The UEFI specification does use another range of characters with special meaning defined in the Unicode specification: the Private Usage Areas. The Private Usage Area, which is a range of character values from 0xE000-0xF7FF) is left open to use by an application. In UEFI, these are used for embedding font control information directly into strings. The values 0xF620-0xF62B control turning specific font styles, such as bold, italics, etc. The values 0xF700-0xF7FF are used to select a specific font. The values 0xF800-0xF8FF select a font of a specific cell height. These values will never occur in normal text.

For more information on these character values, as well as others given special treatment in UEFI, see section 28.2.6.2 of the UEFI 2.3 specification.

Next time we'll continue by looking at the characteristics of Proportional Fonts.