XML Is Picky (Perl for System Administration)

XML Is Picky (Perl for System Administration) Book Home

C.2. XML Is Picky

Despite XML's flexibility, it is pickier in places than HTML. There are syntax and grammar rules that your data must follow. These rules are set down rather tersely in the XML specification found at http://www.w3.org/TR/1998/REC-xml-19980210. Rather than poring through the official spec, I recommend you seek out one of the annotated versions, like Tim Bray's version at http://www.xml.com, or Robert Ducharme's book XML: The Annotated Specification (Prentice Hall). The former is online and free; the latter has many good examples of actual XML code.

Here are two of the XML rules that tend to trip up people who know HTML:

If you begin something, you must end it. In the above example we started a machine listing with <machine> and finished it with </machine>. Leaving off the ending tag would not have been acceptable XML.

In HTML, tags like <img src="picture.jpg" > are legally allowed to stand by themselves. Not so in XML; this would have to be written either as:

<img src="picture.jpg" > </img>

or:

<img src="picture.jpg" />

The extra slash at the end of this last tag lets the XML parser know that this single tag serves as both its own start and end tag. Data and its surrounding start and end tags is called an element.

Start tags and end tags must mirror themselves exactly. Mixing case in not allowed. If your start tag is <MaChINe>, your end tag must be </MaChINe>, and cannot be </MACHine> or any other case combination. HTML is much more forgiving in this regard.

These are two of the general rules in the XML specification. But sometimes you want to define your own rules for an XML parser to enforce. By "enforce" we mean "complain vociferously" or "stop parsing" while reading the XML data. If we use our previous machine database XML snippet as an example, one additional rule we might to enforce is "all <machine> entries must contain a <name> and an <ipaddress> element." You may also wish to restrict the contents of an element to a set of specific values like "YES" or "NO."

How these rules get defined is less straightforward than the other material we'll cover because there are several complimentary and competitive proposals for a definition "language" afloat at the moment. XML will eventually be self-defining (i.e., the document itself or something linked into the document describes its structure).

The current XML specification uses a DTD (Document Type Definition), the SGML standby. Here's an example piece of XML code from the XML specification that has its definition code at the beginning of the document itself:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE greeting [
  <!ELEMENT greeting (#PCDATA)>
]>
<greeting>Hello, world!</greeting>

The first line of this example specifies the version of XML in use and the character encoding (Unicode) for the document. The next three lines define the types of data in this document. This is followed by the actual document content (the <greeting> element) in the final line of the example.

If we wanted to define how the <machine> XML code at the beginning of this appendix should be validated, we could place something like this at the beginning of the file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE machines [
  <!ELEMENT machine (name,department,room,owner,ipaddress)>
  <!ELEMENT name       (#PCDATA)>
  <!ELEMENT department (#PCDATA)>
  <!ELEMENT room       (#PCDATA)>
  <!ELEMENT owner      (#PCDATA)>
  <!ELEMENT ipaddress  (#PCDATA)>
]>

This definition requires that a machine element consist of name, department, room, owner, and ipaddress elements (in this specific order). Each of those elements is described as being PCDATA (see the Section C.4, "Leftovers" section at the end of this appendix).

Another popular set of proposals that are not yet specifications recommend using data descriptions called schemas for DTD-like purposes. Schemas are themselves written in XML code. Here's an example of schema code that uses the Microsoft implementation of the XML-data proposal found at http://www.w3.org/TR/1998/NOTE-XML-data/:

<?XML version='1.0' ?>
<schema id='MachineSchema' 
        xmlns="urn:schemas-microsoft-com:xml-data"
        xmlns:dt="urn:schemas-microsoft-com:datatypes">

<!-- define our element types (they are all just strings/PCDATA) -->
    <elementType id="name">
        <string/>
    </elementType>
    <elementType id="department">
        <string/>
    </elementType>
    <elementType id="room">
      <string/>
    </elementType>
    <elementType id="owner">
        <string/>
    </elementType>
    <elementType id="ipaddress">
        <string/>
    </elementType>

    <!-- now define our actual machine element -->
    <elementType id="Machine" content="CLOSED">
       <element type="#name"       occurs="REQUIRED"/>
       <element type="#department" occurs="REQUIRED"/>
       <element type="#room"       occurs="REQUIRED"/>
       <element type="#owner"      occurs="REQUIRED"/>
       <element type="#ipaddress"  occurs="REQUIRED"/>
    </elementType>
</schema>

XML schema technology is (as of this writing) still very much in the discussion phase in the standards process. XML-data, which we used in the above example, is just one of the proposals in front of the Working Group studying this issue. Because the technology moves fast, I recommend paying careful attention to the most current standards (found at http://www.w3.org) and your software's level of compliance with them.

Both the mature DTD and fledgling schema mechanisms can get complicated quickly, so we're going to leave further discussion of them to the books that are dedicated to XML/SGML.


C.1. XML Is a Markup Language		C.3. Two Key XML Terms