Errata and annotations for Unicode Explained

This document contains a few errata for my book Unicode Explained (O’Reilly, 2006; available from Amazon) as well as some notes on the typography and annotations on topics discussed there, such as clarifications or additional remarks on details or information discovered after the book was published.


Page 32, second line (in the table): Change “Cooper BlkIt BT” to ”Script.” (Cooper Black fonts aren’t really cursive fonts. The Script font, available from BitStream, is cursive.)

Page 139, the paragraph in the middle starting with UTF-8, 5th line: Replace “two to six octets” by “two to four octets”, i.e. with six (6) replaced by four (4). (Explanation: I had been thinking about the original design of UTF-8, which used up to six octets, before the Unicode encoding space was restricted so that four octets are enough.)

Page 172, the first item in the list under “Allocation Areas”: in the sublist, Asian Scripts are mentioned twice.

Page 229, second line from bottom: read “mappingials” as “mappings.”

Page 428, the first sentence under the heading “Superscripts and Subscripts”: the first occurrence of  “1st” should have letters in superscript style, “1st” (since this is what the sentence talks about).

Page 429, under “Roman numerals”, the second line mentions U+2612 as Roman numeral three, but the correct reference is U+2162. Similarly, in the next paragraph, read U+2610 as U+2160.

Page 432, the third bullet: At the end of the text of the bulleted point, replace the sentence “There are three different possible approaches:” by the following: “See notes on this in section ‘Space Characters,’ earlier in this chapter.” Specifically, the intent is to refer to previous discussion on page 416.

Page 430, sixth line: The following text “ shall have a space before the fraction slash(⁄):

even in an 40° angle, as in⁄.

Page 447, the short paragraph in the middle should have the word foo underlined as the text says: “For example, in the RTF format, underlined tex like foo is written as {\ul foo}.”

Page 451, the CSS code sample: read div.greering as div.greeting (i.e., change the second “r” to ”t”).

Pages 467, 468: The reference to UTR #21 shall be read as referring to UTR #20.

Page 478: In the second paragraph, the notations #,##90.00 shall have the digit 9 removed in order to match the content of Figure 9-5.

Page 485: At the start of the page, “Obsoletesubtype” and “simplerich” should be two words each: “Obsolete subtype” and “simple rich”.

Page 532, the 3rd item in the numbered list, 2nd line: Replace “form C (NFKC)” by “form KC (NFKC)”, i.e. add the K.

Page 563, the 7th line from bottom: The function name toChar should be toChars (with an s at the end).

Page 624, under the heading “Arrows,” in the rows describing Up down arrow (↕) and Up down arrow with base (↨), add the entries ↕ and ↨, respectively, into the “HTML” column.


Page 62, in the middle, the URL for the Code2000, Code2001, and Code2002 has stopped working. The new URL is Note: These fonts do not look good without font smoothing. If they look pixelated, consider enabling smoothing (also known as anti-aliasing or grayscaling) in your system settings

Page 160, third paragraph says that the ISO 10646 standard has not been put onto the Web. Now it has been made available, through the Publicly Available Standards page of ISO/IEC JTC1.There you can find ISO 10646 as a zipped PDF file (about 80 megabytes).

Page 170, the text after the table mentions that Unicode 5.0 is intended to add 1,365 characters. The number changed to 1,369 due to the addition of 4 Sindhi characters.

Page 365, the last statement in the last full paragraph mentions work in progress to create the successor of RFC 3066. The successor was issued as RFC 4646, Tags for Identifying Languages, in September 2006. It is accompanied with RFC 4647, Matching of Language Tags.


General note on the Appendix (Tables for writing characters): due to typesetting problems, some characters do not appear as distinguishable enough. For example, the different quotation marks on p. 622 are rather difficult to distinguish from each other, and on p. 627, under the heading “Spaces,” the symbol for space (consisting of small-size S and P in a special setting) used in the “Word” column is barely noticeable.

In the text of the book in general, there is some usage of fonts that might be regarded as inconsistent. For example, if you look closely at the IPA text [ɑ̃bʀ] on the last line of page 355, you’ll see that the letter b is of different (and smaller) design that the other characters. The typesetting of the book is not perfect, since it was rather difficult to make various Unicode characters in the text appear at all. A mixture of fonts was used (see the Colophon on page 659), causing typographic misfits at times. (In the particular case described, I should have used the same font for “b” as for the specifically IPA characters.)

Mixing fonts has also caused some uneven line spacing. For example, on page 42, the second bulleted point contains the Cyrillic letter yu, ю, and later the same letter with an acute accent. If you look closely, you’ll notice that the first three lines are evenly spaced vertically, but for the rest, there is slightly more vertical spacing. The reason is that the Cyrillic letters are from another font, for which the typesetting system used a larger line height. Due to the somewhat ad hoc approach used in typesetting, this problem was not detected early enough. By the way, similar problems easily arise in Microsoft Word, too, when you mix fonts in text. The general idea in fixing such problems is to set the line height of a text paragraph to a fixed value.

In  Word, you could select the paragraph, give the command Format/Paragraph, select “Exactly” from the menu for line height options, and then select a suitable value in points. Now the problem vanishes: line height is even in the paragraph. Note that after selecting “Exactly,” Word seems to default the line height to the  size of the font, which is of course too small. You may need to experiment a little to find out what line height Word has used for the font, unless you  know it or can guess it. For example, for 12pt Times, Word seems to use a line height of  14pt. You can also avoid the problem beforehand by setting the style of the paragraph so that line height is set exactly to a suitable value. 


Regarding Sending Unicode Email (p. 53–54), the text of the book basically discusses the use of normal email programs. In webmail systems, which have become rather important, the situation is worse. Most webmail systems seem to use a fixed non-Unicode encoding, and they display Unicode encoded messages wrongly even if they have been sent properly. The reason is that these systems read the message according to a fixed encoding instead of checking the message headers for the encoding. In some situations, you can read e.g. a UTF-8 encoded email message by setting encoding used by the browser to UTF-8. In some situations, you can get Unicode characters through if you use an attachment or use the HTML format.

Moreover, there are some issues in Unicode email even in normal email programs.

Regarding Methods Using the Alt key on Windows on p. 84–88, note that this approach—though of great practical importance to most computer users—does not conform to any standard. There is an international standard, ISO 14755, that specifies a method for typing a character by its code number in a rather similar way, though using a different specific technique. In this method, you would hold the Ctrl and Shift keys pressed down when typing the code number, so that e.g. Ctrl-Shift-a9 would produce U+00A9, i.e. the copyright sign . This method is available in some Linux systems. For a description, consult the text of the standard, officially ISO/IEC 14755  (in PDF format; identical in content to the approved standard).

As a general note on indicating key combinations, the notations used in the book try to be practical and to conform to the publisher’s general style. For a really universal and unambiguous notation for them, I think we would need something markup-like, like using Ctrl(x) to indicate typing x when the Ctrl key is held down. This would result in somewhat long notations like Ctrl(Shift(a9)) and would deviate from all currently used varying notations for such things. We should perhaps also indicate somehow, e.g. by underlining, which characters are to be typed using the numeric keypad or its equivalent. Yet, I think that e.g. Alt(0169) would be more understandable and easier to interpret correctly, once the idea of the notation has been explained. In particular, the method that is now referred to as the Alt-+n method (p. 87–88) would be called more understandably as the Alt(+n) method.

Regarding DOS Code Pages (p. 128–129), note that on Windows XP and Windows 2000, you can find out (or change) the system’s current DOS code page setting by entering the command prompt and using the command chcp there.

Regarding Encodings for Chinese (p. 141–142) and Chinese characters in general, see the extensive web site

For generalities, see also my blog entry The Paradox of Unicode Adoption: Unicode works in casual memos but not in books.

I’m grateful to Joe Clark and Asmus Freytag for pointing out some of the errors documented here.
Last update of this page: November 25, 2011.
Jukka K. Korpela