In the Common.Text project (Text II, part 1), the text is stored internally as 64-bit character codes (ulong in C#). At first, using 64 bits to store a single character may seem foolish, as most western languages would be perfectly happy with an 8-bit encoding. Since .NET uses 16-bit character codes to represent Unicode text, 16-bit (char) should have been the logical choice for an internationally oriented piece of software.
However, 16 bits are not enough to cover the full Unicode range without using surrogate pairs for the codes above U+FFFF. In UTF-16, these codes are represented using a high-surrogate code (in the range U+D800 to U+DBFF) and a low-surrogate code (in the range U+DC00 to U+DFFF); together, this pair provides 20 bits of useful information, which is used to map the codes from U+10000 to U+10FFFF. Having to deal with the surrogate pairs is just a pain in the neck, and I wanted to get rid of them in order to have a one-to-one mapping between the cursor position to the character offset. So I chose to use at least 21 bits to store a character code (which currently covers the whole defined Unicode range).
So why not use the next natural size (24 bits or 32 bits)? Simply because I also wanted to attach to every character its complete formatting information, plus some status flags which would allow me to decide very quickly if a piece of text belongs to the active selection, or not. I settled to the following internal representation:
As you can see, the formatting information is stored in three chunks, in the top 7+7+18 bits. Then come 5 bits used as markers (flags), 3 bits used to cache the results of a line break analysis, 3 bits to store Unicode related flags and 21 bits for the full Unicode character code.
The formatting information has been split in three in order to speed up the layout analysis. The style index is used as an index into a table which stores font and font geometry related information. The local settings index and the extra settings index are used as indexes into subtables, which store information such as a glyph replacement code, an image name, a link target, a language or a color.
With this setup, characters which share the same font face, font style and font size also share the same style index. Therefore, when analysing a piece of text with a given style index, the layout engine only needs to look up the formatting information once, even if the characters have different colors.