Special characters

Special characters
	Chapter 19. Languages, characters and encoding

XML is based on Unicode, which contains thousands of characters and symbols. A given XML document specifies its encoding, which is a mapping of characters to bytes. But not all encodings include all Unicode characters. Also, your keyboard may not enable you to directly enter all characters in the encoding. Any characters you cannot enter directly are entered as entities, which consists of an ampersand, followed by a name, followed by a semicolon.

There are two kinds of character entities:

numerical character entities: An entity name that consists of a # followed by the Unicode number for the character, such as á. The number can be expressed as a decimal number such as &#431, or a hexadecimal number (which is indicated by x as a prefix), such as ".
named character entities: A readable name such as ™ can be assigned to represent any Unicode character.

The following table shows examples of some characters expressed in both kinds of entities.

Table 19.2. Examples of character entities

Character	Decimal numerical entity	Hexadecimal numerical entity	Named entity
á	`á`	`á`	`á`
ß	`ß`	`ß`	`ß`
©	`©`	`©`	`©`
¥	`¥`	`&#x00A5`	`¥`
±	`±`	`±`	`±`
✓	`⇒`	`✓`	`&check;`

Note

Leading zeros can be omitted in numerical entities. So á is the same as á.

The set of numerical entities is defined as part of the Unicode standard adopted by XML, but the names used for named entities are not standardized. There are several standard named character entity sets defined by ISO, however, and these are incorporated into the DocBook DTD. Among the collection of files in the DocBook XML DTD distribution, there is an ent subdirectory that contains a set of files that declare named entities. Each declaration looks like this:

<!ENTITY  plusmn  "&#x00B1;">    <!-- PLUS-MINUS SIGN -->

This declaration assigns the numerical character entity ± to the plusmn entity name. Either ± or ± can be used in your DocBook document to represent the plus-minus sign.

Note

If you use DocBook named character entities in your document, you must also make sure your document's DOCTYPE properly identifies the DocBook DTD, and the processor must be able to find and load the DTD. If not, then the entities will be considered unresolved and the processing will fail.

Use this reference to look up DocBook entities that you need in your documents:

DocBook Character Entity Reference

If you are a user of the Emacs text editor, then you might want to check out Norm Walsh's Emacs extensions for DocBook, which includes a selector for special characters. See http://nwalsh.com/emacs/xmlchars/.

A Unicode reference such as Unicode Code Charts can be used to look up numerical character entities. Most online references use PDF to display the characters because most browsers cannot display all of Unicode.

Special characters in output

When an XSLT processor reads an entity from the input stream, it tries to resolve the entity. If it is a numerical entity, it converts it to a Unicode character in memory. If it is a named entity, it checks the list of entity names that were loaded from the DTD. For the DocBook named character entities, these resolve to numerical entities. Then these numerical entities are also converted to Unicode characters in memory. All the characters in the document are handled as Unicode characters in memory during processing.

See the section “Output encoding” for a description of how special characters in memory are converted to output characters.

Missing characters

When a DocBook document is processed with an XSLT processor, you may find that some or all special characters are missing in the output. There are several possible causes for this problem.

If only one entity is not resolving, it may simply be entered wrong. If you misspell a named entity, it won't resolve. If you enter a numerical entity wrong, the number may resolve to a range in Unicode that does not have printable characters. Also, if you intend to enter a hexadecimal value, be sure to include the x prefix or it will be interpreted as a decimal value.
If you are using named character entities, the DocBook XML DTD must be available to the processor. That's because the named entities are defined in the DTD, and the processor won't know what the names mean unless it can load the DTD. Most XSLT processors do not do full validation, but they do load the entities defined in the DTD. If the DTD is not available, the processor may continue processing the document as a well-formed document, but it won't be able to resolve the named character entities.
If the output encoding does not include the character, the XSLT processor should convert it to a numerical character entity. Then the downstream viewer or processor must be able to handle such entities. For example, old browsers may not recognize the full range of numerical character entities in Unicode.
The output medium may not have a font loaded that can display a special character. For example, a PDF file that does not contain embedded fonts relies on the system to supply requested fonts. If the font in use does not have a given character, it may not show up in the display. Likewise with HTML, when a system viewing an HTML file does not have a screen font installed for a given encoding.
If you are doing XSL-FO processing, then your FO processor must be able to switch to a symbol font for special characters that are not in the main body font. The DocBook FO stylesheet automatically uses a list of fonts wherever the body font is called for. The list is made up of the value of the body.font.family parameter and the symbol.font.family parameter, so the default list is serif,Symbol,ZapfDingbats, and it is stored in theinternal body.fontset parameter. A similar list is created using the title.font.family parameter and stored in the internal title.fontset parameter. To use these lists, one of the root properties of DocBook XSL-FO is font-selection-strategy="character-by-character", which means the processor will search the list of font names for a character. If a special character is not in the body font, then the Symbol font is searched, then the ZapfDingbats font. You can expand that list by adding more font names to the symbol.font.family parameter, assuming those extra fonts are configured into your processor.
If you are using FOP version 0.20.5 or earlier, then this scheme doesn't work because it doesn't support the font-selection-strategy property. See the section “Switching to Symbol font” for a workaround for this problem.

HTML encoding

For a browser to know what encoding an HTML file is written in, it must be told. If a browser isn't told, then it may guess wrong and render many characters wrong. The encoding can be communicated to the browser in two ways for HTML files:

A META tag embedded in the HTML HEAD element of the file:

<meta  content="text/html;  charset=UTF-8"  http-equiv="Content-Type">

An encoding instruction in the HTTP header that accompanies the HTML file. The HTTP header is not in the HTML, but is sent by the HTTP server before the file and is hidden from the viewer. It provides information to the browser about the file.
```
Content-type: text/html; charset=UTF-8
```

It is best if both of these methods are used, and of course they must both agree with the actual file encoding. Often you don't have control of the HTTP header, so using the META element is the only option.

For XHTML output, there is a third avenue to convey the encoding. Because the output is XML, it should have an XML declaration, and the declaration should also contain the encoding:

<?xml version="1.0"  encoding="UTF-8" ?>

Odd characters in HTML output

If you are seeing odd accented characters when you browse your HTML output, then you probably have an encoding problem. You are seeing special characters that are encoded one way being misinterpreted by the browser as a different encoding. For example, if a file is encoded as UTF-8 and the browser thinks it is ISO-8859-1, then a special character such as an em-dash will appear as “â€” in the browser. More commonly, a nonbreaking space character in UTF-8 will appear as a “Â ” when viewed as ISO-8859-1.

The previous section describes how to set the encoding in the HTML. But if the HTTP server delivering the HTML gets the encoding wrong, then the browser may use the HTTP header instead of the hints within the HTML. If the odd characters become normal when the HTML file is browsed as a local file, then it is likely an HTTP server issue. For example, an Apache server might have an AddDefaultCharSet directive that sets the default encoding for all files to iso-8859-1. If you cannot fix the Apache server configuration, then you could try adding a .htaccess file to your HTML directory and add your own AddDefaultCharSet directive to it. See the Apache documentation for more details.

Switching to Symbol font

If you are generating PDF output and you find certain special characters are missing, then the problem might be that the body font does not contain the character you need. Many special characters such as math symbols are in the Symbol font, and are not in Times or Helvetica. Your FO processor may not be able to switch to the Symbol font for a single character.

The following is a customization that you can use to coerce a special character into the Symbol font.

<xsl:template match="symbol[@role = 'symbolfont']">
  <fo:inline font-family="Symbol">
    <xsl:call-template name="inline.charseq"/>
  </fo:inline>
</xsl:template>

With this added to your customization layer, you can mark up a special character such as ≤ (≤, less than or equal to) to be coerced into the Symbol font with <symbol role="symbolfont">≤</symbol>. The customization wraps a fo:inline element around the character to explicitly switch in and out of Symbol font for that character.


Output encoding		Language support