XML/HTML handling APIs

Some XML APIs are designed exclusively for reading, while others provide for both reading and building XML documents. Some are appropriate for complex XML containing data that needs to be navigated in arbitrary ways; some are appropriate for simpler XML whose data just needs to be read in.

The choice of software depends very much upon the application as well as programming preferences. This document is an attempt to sort out the practical issues of picking an API for handling XML.

validation and well-formedness

At a minimum, an XML document should be well-formed in order to be read. The requirement is defined precisely in a fairly simple standard. Basically, the text body of an XML file should contain XML elements, whose tags and attributes are properly coded. In particular, in contrast with classical HTML, all elements are required to be closed somehow.

An XML document should be associated with a document type definition (DTD), which defines what elements are allowed in the document, what attributes the elements have, and what the relationships between elements are. This is a robustness issue: unless the input is completely understood, it is very hard to predict what a program may make of it. A validating parser should validate that the document conforms exactly to its DTD.

The DTD usually resides in a file at a URL however, and to really validate would involve obtaining that file. Obtaining it may be a serious time overhead, and of course maybe the network is down... Also, for many applications, the XML may already have been validated, so further validation is redundant. Furthermore, to check that a document is well-formed involves analyzing the whole thing—for many applications that isn't necessary, and is a time overhead as well.

Many parsers fall back to insisting only that the XML be well-formed, without really checking that it conforms to a DTD. Some just blindly trundle ahead until they either come to completion or fail. In such cases, the onus is on the programmer to properly handle failures.

tree parsing

By nature, XML is a hierarchical data format—one element may be contained within another. This hierarchy of containment is modeled as a tree structure (in computer jargon).

For many applications, this tree structure is of primary importance—one may need to navigate through it in all sorts of ways. For such applications, it is best to use a parser that converts the XML document directly into a tree structure in memory, and provides an API for navigating that structure.

DOM

The Document Object Model (DOM) specifies a complete, standard programming interface (API) for handling XML documents in their full tree structure.

The DOM is designed to be programming language-neutral. The big payoff is that programmers familiar with the DOM can easily move between programming environments, even between programming languages. The other side is that to practitioners of any particular language, DOM implementations may seem clumsy and unnatural.

The DOM solution is surprising at first look, in that it involves two layers of entities: a Node (a generic container) and an Element (an XML element), which is a Node. The data of an Element is itself a text Node—but not an Element. The Document itself is also a Node, but not an Element. The aims of the designers are clear—it isn't clear to me that their solution was optimal.

Sequential parsing

For some applications, the data in the XML may have a very simple format, for which the complexities of a tree are unnecessary. For some applications, the whole XML document may not even be available.

Many parsers, instead of assuming the whole document is already in memory, just read the document sequentially from the beginning, passing elements to user-supplied callback functions as they are found. There are typically no separate entities for XML elements or the tree structure.

A sequential parser may be expected to require less memory and computational resources than does a tree parser (in particular, by nature, it needn't keep a representation of the document in memory).

From a programming perspective, the advantage is that the API may be very simple indeed. The responsibility for interpreting the contents of elements is moved entirely to the user-supplied callback function. Here is the question a programmer needs to ask: how much of the structure do they want code for personally?

Sequential parsers typically do not validate the XML (although some do.)

Sequential parsing is appropriate for applications whose XML is very simple, or where the XML may not be completely available.

SAX

One category of sequential parsers is SAX (Simple API for XML) parsers. which take the XML as a data stream.

Unlike the DOM, there isn't a standard for SAX—really it's so simple none is called for.

HTML parsing

Although HTML standards are nowadays essentially XML, the practical fact is that a huge fraction of existing HTML fails to conform to any standard, or is outright broken. Furthermore, HTML has some special features that need to be handled.

Web pages in HTML are often extremely badly formed (to the point where they don't work right, and the intent of the code isn't at all clear.) There is little that a parser could do for badly mangled documents. But often just an end-tag is left off, and it's obvious where it needs to be. So HTML parsers are mostly very forgiving, at the expense of being much slower and less robust than XML parsers.

The most obvious special HTML feature is its special set of tags, which have a fairly standard meaning. Another special feature is its set of several dozen character entities, such as   which provide for some special characters that can't be represented directly in HTML. XML defines exactly three (&, <, and  ). Another HTML feature is meta tags, which have been used traditionally to specify the document's character encoding—crucial information for parsing the document.

Even if the application is expected to read only conformant XHTML, a special HTML parser is very helpful in handling these features. If the application is meant to handle generic HTML from the Web, an HTML parser may be necessary (one might be able to get by with a SAX parser, but one might be sorry...)

document building

The building of XML documents is even more application-dependent than parsing. Except for the simplest XML documents, any solution is better than assembling strings for XML documents by hand.

The main question is, does it make sense to build the document sequentially? If the application calls only for sequential building, there is little call for the added complexities of the tree-based builders. On the other hand, if there is any need to modify elements in an existing tree, or navigate the tree in any way other than sequentially, a tree-based builder is strongly called for.

If DOM-based navigation standards such as XPATH are involved, a full DOM implementation is the only solution.

Implementations

Most modern applications programming languages have some XML processing facilities.

There are multiple variations of the standards. First, since the DOM is rather complicated, there are multiple simplified implementations of it, as well as simpler non-DOM tree parsers.

Python

Standard XML handling implementations are built into the usual Python distribution; several other very good ones are readily available. Some of the XML packages are quite complex, so that, it may not be obvious which low-level parser they use. (Some permit the user to replace the lowe-level parser).

About performance: All you beginners, listen up: inappropriate concern about speed or performance is a glaring mark of a greenhorn. (Experienced programmers, please accept my apologies.)

Implementations that are written in Python down to a low level may suffer performance-wise for applications involving large amounts of data; Python wrappers for machine-native XML handling libraries have therefore appeared. These provide better performance for sequential parsing of simple files than pure Python implementations. However, they tend to reflect the idiosyncratic interface of their underlying native libraries.

Here are some of the more popular XML handling packages available.

xml.dom: full DOM implementation in pure Python
xml.dom.minidom: pure Python simplified DOM implementation
xml.etree.ElementTree: pure Python interface to XML, simpler than the DOM, based on a single XML element class which is also a container. Also provides validation, sequential-parsing, and XML generation capabilities.
xml.sax: SAX parser
lxml: wrapper for native libxml2/libxslt libraries. Has ElementTree wrapper.
xml.parsers.expat: wrapper for native non-validating, sequential parser expat
lxml.html: HTML parser
lxml.html.soupparser: the BeautifulSoup HTML parser
xml.etree.ElementTree.TreeBuilder: full tree-based XML document building
xmlbuilder: pure Python sequential XML generation, built on top of xml.etree.ElementTree.TreeBuilder