Text II, part 2

Sunday, October 30. 2005

In the Common.Text project (Text II, part 1), the text is stored internally as 64-bit character codes (ulong in C#). At first, using 64 bits to store a single character may seem foolish, as most western languages would be perfectly happy with an 8-bit encoding. Since .NET uses 16-bit character codes to represent Unicode text, 16-bit (char) should have been the logical choice for an internationally oriented piece of software.

However, 16 bits are not enough to cover the full Unicode range without using surrogate pairs for the codes above U+FFFF. In UTF-16, these codes are represented using a high-surrogate code (in the range U+D800 to U+DBFF) and a low-surrogate code (in the range U+DC00 to U+DFFF); together, this pair provides 20 bits of useful information, which is used to map the codes from U+10000 to U+10FFFF. Having to deal with the surrogate pairs is just a pain in the neck, and I wanted to get rid of them in order to have a one-to-one mapping between the cursor position to the character offset. So I chose to use at least 21 bits to store a character code (which currently covers the whole defined Unicode range).

So why not use the next natural size (24 bits or 32 bits)? Simply because I also wanted to attach to every character its complete formatting information, plus some status flags which would allow me to decide very quickly if a piece of text belongs to the active selection, or not. I settled to the following internal representation:

Meaning of the 64 bits used to represent a character

As you can see, the formatting information is stored in three chunks, in the top 7+7+18 bits. Then come 5 bits used as markers (flags), 3 bits used to cache the results of a line break analysis, 3 bits to store Unicode related flags and 21 bits for the full Unicode character code.

The formatting information has been split in three in order to speed up the layout analysis. The style index is used as an index into a table which stores font and font geometry related information. The local settings index and the extra settings index are used as indexes into subtables, which store information such as a glyph replacement code, an image name, a link target, a language or a color.

With this setup, characters which share the same font face, font style and font size also share the same style index. Therefore, when analysing a piece of text with a given style index, the layout engine only needs to look up the formatting information once, even if the characters have different colors.

Posted by Pierre Arnaud at 11:46 | Comments (0) | Trackback (1)

Trackbacks

Trackback specific URI for this entry

Some progress report, at last
November and December have been very busy months. Daniel and I are working very hard in order to release a new version of Creative Docs .NET before Christmas... 2005. Our main goal is to ship a stable, even if not fully finished, new text object (known as

Weblog: Creative Docs .NET is alive!
Tracked: Dec 14, 22:35

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	Enclosing asterisks marks text as bold (word), underscore are made via _word_. Standard emoticons like :-) and ;-) are converted to images. E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To leave a comment you must approve it via e-mail, which will be sent to your address after submission. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above:
	Remember Information? Subscribe to this entry

Creative Docs .NET

Express yourself

Sunday, October 30. 2005

Text II, part 2

Calendar

Quicksearch

Archives

Categories

Syndicate This Blog

Blog Administration