In the Common.Text
project (Text II, part 1
), the text is stored internally as 64-bit character codes (ulong
in C#). At first, using 64 bits to store a single character may seem foolish, as most western languages would be perfectly happy with an 8-bit encoding. Since .NET uses 16-bit character codes to represent Unicode text, 16-bit (char
) should have been the logical choice for an internationally oriented piece of software.
However, 16 bits are not enough to cover the full Unicode range without using surrogate pairs
for the codes above U+FFFF
. In UTF-16
, these codes are represented using a high-surrogate code (in the range U+D800
) and a low-surrogate code (in the range U+DC00
); together, this pair provides 20 bits of useful information, which is used to map the codes from U+10000
. Having to deal with the surrogate pairs is just a pain in the neck, and I wanted to get rid of them in order to have a one-to-one mapping between the cursor position to the character offset. So I chose to use at least 21 bits
to store a character code (which currently covers the whole defined Unicode range).
So why not use the next natural size (24 bits or 32 bits)? Simply because I also wanted to attach to every character its complete formatting information, plus some status flags which would allow me to decide very quickly if a piece of text belongs to the active selection, or not. I settled to the following internal representation:
As you can see, the formatting information is stored in three chunks, in the top 7+7+18 bits. Then come 5 bits used as markers (flags), 3 bits used to cache the results of a line break analysis, 3 bits to store Unicode related flags and 21 bits for the full Unicode character code.
The formatting information has been split in three in order to speed up the layout analysis. The style index
is used as an index into a table which stores font and font geometry related information. The local settings index
and the extra settings index
are used as indexes into subtables, which store information such as a glyph replacement code, an image name, a link target, a language or a color.
With this setup, characters which share the same font face, font style and font size also share the same style index. Therefore, when analysing a piece of text with a given style index, the layout engine only needs to look up the formatting information once, even if the characters have different colors.
November and December have been very busy months. Daniel and I are working very hard in order to release a new version of Creative Docs .NET before Christmas... 2005. Our main goal is to ship a stable, even if not fully finished, new text object (known as
Tracked: Dec 14, 22:35