Regardless of what the encoding is for your documents, an XSL engine can convert the output to a different encoding if you need it. When the document is loaded into memory, XML applications such as XSLT engines convert it to Unicode. The XSL engine then uses the stylesheet templates to create a transformed version of the content in memory structures. When it is done, it serializes the internal content into a stream of bytes that it feeds to the outside world. During the serialization process, it can convert the internal Unicode to some other encoding for the output.
An XSL stylesheet usually sets the output encoding in an xsl:output
element at the top of the stylesheet file. Here is that element for the html/docbook.xsl
stylesheet:
<xsl:output method="html" encoding="ISO-8859-1" indent="no"/>
The encoding="ISO-8859-1"
attribute means all documents processed with that stylesheet are to be output with the ISO-8859-1 encoding. If a stylesheet's xsl:output
element does not have an encoding attribute, then the default output encoding is UTF-8
. That is what the fo/docbook.xsl
stylesheet for print output does.
When the output method="html"
, the XSLT processor also adds an HTML META
tag that identifies the HTML file's encoding:
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
When a browser opens the HTML file, it reads this tag and knows the bytes it finds in the file map to the ISO-8859-1 character set for display. What if the document contains characters that are not available in the specified output encoding? As with input, the characters are expressed as numerical character entities such as ™
. It is up to the browser to figure out how to display such characters. Most browsers cover a pretty wide range of character entities, but there are so many that sometimes a browser doesn't have a way to display a given character.
Most modern graphical browsers can display HTML files encoded with UTF-8, which covers a much wider set of characters than ISO-8859-1. To change the output encoding for the non-chunking docbook.xsl
stylesheet, you have to use a stylesheet customization layer. That is because the XML specification does not permit the encoding attribute to be a variable or parameter value. Your stylesheet customization must provide a new <xsl:output>
element such as the following:
<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:import href="/path/to/
html/docbook.xsl"/>
<xsl:output method="html"
encoding="UTF-8"
indent="no"/>
</xsl:stylesheet>
This is a complete stylesheet customization that you can save in a file such as docbook-utf8.xsl
and use in place of the stock html/docbook.xsl
stylesheet. All it does is import the stock stylesheet and set a new output encoding, in this instance to UTF-8. Any HTML files generated with this stylesheet will have their characters encoded as UTF-8, and the file will include a meta tag like this:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Changing the output encoding of the chunking stylesheet is much easier. It can be done with the chunker.output.encoding
parameter, either on the command line or in a
customization layer. That's because the chunking stylesheet uses
EXSLT extensions to generate HTML files. See
the section “Output encoding for chunk HTML” for more information.
If you are using the Saxon processor with the chunking stylesheet for non-English HTML output, then you may want to set the stylesheet parameter saxon.character.representation
to a value of 'native;decimal'
. By default, this parameter (which is defined in
html/chunker.xsl
) is set to 'entity;decimal'
. The default value of entity
before the semicolon means that any non-ASCII characters
within the encoding are converted to named entities such as á
instead of the character code for that encoding. For
example, when using the iso-8859-1
output encoding, this means one native character is
replaced by the 8 ASCII characters that form the named entity, which
makes your files considerably larger. When entity
is replaced with native
, the single character code of the encoding is
output.
The value after the semicolon controls how characters that are outside the encoding are output by Saxon. They must be converted to some kind of entity, and the value can be entity
(named entity such as á
if one exists), decimal
(decimal numeric entity such as á
), or hex
(hexadecimal numeric entity such as á
). Saxon outputs named entities only for characters in ISO-8859-1, not all DocBook named character entities.
If you are using the chunking stylesheet, then you can use this parameter to set the Saxon output character representation. If you are using the non-chunking stylesheet, then your customization of xsl:output
as described above needs to be enhanced as follows:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:saxon="http://icl.com/saxon" extension-element-prefixes="saxon"> <xsl:import href="file:///c:/docbook/xsl/html/docbook.xsl"/> <xsl:output method="html" encoding="UTF-8" indent="no" saxon:character-representation="native;decimal"/>
DocBook XSL: The Complete Guide - 3rd Edition | PDF version available | Copyright © 2002-2005 Sagehill Enterprises |