Articles / XML Software

XML Software

There are two separate groups of XML software: Low-level, configurable software designed to work with any XML-based format, and high-level, special-purpose software designed to work with one or more specific XML formats. Like most people writing about XML, I will focus on the first group, which includes low-level tools and libraries like parsers, editors, browsers, transformation engines, and search and query tools.

In fact, most people working with XML actually need something from the second group, if they can find it. After all, if you needed to install a Web browser, would you start looking for HTML parsers and TCP/IP stacks to download? The trouble is that it hardly makes sense to write about the second group in a single "XML" article because the range of applications is so broad. Robin Cover lists over 500 different XML-based formats and projects, ranging from the American Iron and Steel Institute XML Workgroup to the VISA XML Invoice Specification to the Mind Reading Markup Language, and there are hundreds (perhaps thousands) more XML-based formats not in the list, not to mention the dozens (or hundreds) of applications like AbiWord, Gnumeric, and Flight Gear that happen to use proprietary XML-based save or configuration formats.

So, if you're looking for information on setting up distributed computing with SOAP and .NET, handling VISA transactions, or interpreting mind-reading data sheets, you should start by going to sites that deal specifically with those areas. If they're actually using XML (rather than just talking about it), they should have links to any available software and tutorials that you can use.

If you're still reading, I will assume that you could not find any high-level software, and that you have decided to build your own system from some of the low-level XML building blocks that follow.

A Note on Programming Languages

From the start, the best support for XML has been in Java. Many of the original XML designers and implementors were Open Source Java programmers, and both Java and XML are Unicode-based. (XML causes headaches for languages that use 8-bit character sets).

The Perl and Python communities, however, have worked very hard to catch up. Python can wrap any Java-based XML library, so Python programmers can (technically) claim that they can do anything the Java users can do, while announcements are constantly appearing for new XML-related Perl modules.

Support for XML and XML-related applications in C and C++ is growing, but is still weak relative to Java, Python, and Perl. XML-based projects sometimes end up shifting from C++ to Java to take advantage of better support (and to avoid the character-size problem).

There is at least some XML software for nearly every programming language currently in use (and some that are not); if in doubt, try a freshmeat search in the XML category.

XML Parsers

An XML parser is a tool to read raw XML in text form and convert it into tokens or a parse tree (much as the parser in a C compiler does). Parsers are usually available as libraries that you can link into your application, and some are also available as standalone applications for error-checking documents in batch mode.

Many of the more naive XML books and articles will try to convince you that the parser is the starting point for all XML work. Technically, that's true (there's a parser hidden at the bottom of almost all XML software), but parsers operate at a low level and require an advanced knowledge of the XML. There are two good reasons that you should download and install a parser:

  1. You want or need to write an XML-based library or application from scratch, starting at the lowest possible level, or
  2. You are using an application or library that requires an XML parser but does not bundle one with it.

If neither of these applies to you, skip the rest of this section for now.

In the XML community, there are two standard interfaces that nearly all parsers implement for passing information to an application: The Simple API for XML (SAX) and the Document Object Model (DOM). SAX is a streaming, event-based interface, while DOM is a random-access, tree-based interface. XML programmers usually code against these interfaces rather than the parsers' own native interfaces, so it will be easy to change to a different parser library later (in Java, users can even change parser libraries at runtime). If you need help choosing between DOM and SAX, take a look at my old Events vs. Trees paper at the SAX site. It is best to look for parsers that support SAX2 and DOM2 rather than the original SAX and DOM interfaces.

In the Java world, the Apache XML Project's Xerces parser is probably the most popular and full-featured, and it includes SAX2 and DOM2 support. For applications for which size is important (such as Java applets), the AElfred parser (now allegedly bundled in the GNU JAXP project) has a very small footprint and SAX2 support.

Most Perl XML packages rely on the XML::Parser module, which you will need to install before you can use anything else.

In the C/C++ world, the SAX and DOM interfaces are not standardized as well as they are for Java, so some parser dependencies are inevitable. Apache's C++ version of Xerces is becoming stable and more feature-rich, though it usually lags behind the Java version. For a smaller footprint, you can use the C-based Expat parser, which provides a streaming interface similar (but not identical) to SAX. Expat provides the core XML support for both Perl and Mozilla, so it is very stable and well-supported. For other Open Source projects (including most Gnome software), the C-based XML parser of choice is libxml. Like Expat, libxml has been heavily testing in demanding environments.

There are XML parsers available for most programming languages, including ECMAScript (JavaScript), PHP, Tcl, and others. Try a quick search for "parser" in the freshmeat XML category to find what you are looking for.

XML Transformation Engines

A transformation engine is a library or program that can modify an XML document automatically according to a set of rules, much as sed or awk do for ordinary text. Instead of regular expressions, most transformation engines read rules from a separate XML file using a format called XSLT.

Some XSLT engines are available as libraries that you can link into your applications, and most are available standalone (for use in batch scripts). Unfortunately, XSLT libraries have no standard interfaces like SAX and DOM for XML parsers, so once you've coded for one XSLT library, switching to another might require a fair bit of refactoring. You need to spend more time up front ensuring that you have the right library for your needs.

In the Java world, the two most popular XSLT engines are Michael Kay's SAXON and the Apache XML Project's Xalan. James Clark's XT, the original reference XSLT engine, is not as full-featured or actively maintained as the others, but is still solid.

C++-based XSLT engines can run slightly faster than their Java-based cousins, but, in general, the C/C++ versions are not as far developed or debugged, so crashes or missing features are much more common. Apache's Xalan-C++ is probably the most stable of them.

In Perl or Python, your best bet is to invoke either a Java- or C++-based XSLT engine as an external process (or to wrap one in Python), though Perl does have a native XML::XSLT module for those who want to try it.

Note that XSLT is not your only option for transformations, and is often not the best one. Since XSLT has a tree-based data model (usually in-memory), it can be extremely slow and resource-hungry for larger XML documents, making it inappropriate for high-demand server-side use. Sometimes there is no alternative but to custom-code your transformation in Java, C/C++, Perl, or Python, working directly with the streaming parser output. SAXON has some support for on-the-fly streaming transformations, so it might be worth investigating if speed or resource usage is an issue.

XML Browsers

An XML browser is a viewer that can display an arbitrary XML document in a way optimized for human users. Since XML data can represent just about anything, that's not a trivial problem. There are two main types of general-purpose XML browsers:

  1. Tree-oriented browsers that display the XML document in a tree widget, and
  2. Document-oriented browsers that use stylesheets to format the XML data.

Tree-oriented browsers have some (limited) value for technical specialists navigating through a tree of hierarchical data structures, but their main advantage is that they are easy to write (just hook a tree widget up to the DOM), so there are many of them available. If you have a free evening, you can roll your own with (say) the Java Swing JTree widget.

For most serious work, you will need a document-oriented browser. The bad news is that they are very hard to write and maintain. The good news is that the two major ones are also two of the world's best-known software applications: The closed source Microsoft Internet Explorer (version 5+) and the Open Source Mozilla (or Netscape version 6+). Both can take any arbitrary XML document, together with a stylesheet in XSLT or CSS format, and render it so that it looks like a regular HTML page.

XML Editors

XML editors face the same problem as XML browsers: There is no single, obvious editing interface for all types of XML data. What is appropriate for an XML-based government report, for example, is probably not appropriate for an XML-based vector graphic.

As with browsers, there are two major types of general-purpose XML editors:

  1. Tree-based editors that display the XML document in an editable tree widget, and
  2. Document-based editors that allow the XML to be edited in place and (sometimes) use stylesheets to give the XML data a semi-WYSIWYG appearance.

In nearly all cases, you will want either a document-based editor (possibly with a tree-based sidebar to help with navigation) or a customized form interface (in which case you may need to build from the parser level up). Some editors allow the use of DTDs or Schemas for guided authoring, which can be extremely valuable.

The one robust, production-grade, general-purpose, Open Source XML editor is the PSGML package for Emacs and XEmacs. PSGML uses DTDs for guided authoring and autocompletion, but it has extremely limited visual capabilities (mainly just markup highlighting). Unfortunately, it has no facility for turning off DTD validation, so it produces many annoying errors and warnings when working with XML documents that have no fixed DTD.

Another fairly robust (but restricted and closed source) XML editor is Henry Thompson's XED, which uses heuristics rather than DTD validation for guided authoring.

XML Search and Query Tools

Regular expressions and proximity tests are useful for full-text searches on prose documents, while tightly-structured SQL queries are useful for searches through relational data tables. XML falls somewhere between, and does not yet have a finalized query language of its own (the XML Query language is still under development).

As a result, the most common way to query an XML document programatically (without setting up a full-fledged XML-oriented database and search engine) is through XPath, a simple language for specifying a location in an XML document. Many XML-based tools, including all XSLT engines, have XPath support built in. If you are designing an application that requires standalone XPath support, a good bet is Jaxen.

Open Source XML search engines should all be considered experimental at this point, since none has been widely implemented, and (as mentioned above) there is not yet a finished XML Query language.

Recent comments

02 Oct 2002 17:46 Avatar vprog

Xerlin
Didn't see a mention of Xerlin (http://www.xerlin.org), which is a very nice Java/OpenSource XML editor.

Also for Windows there's a free (but not open source) program called Cooktop (http://www.xmlcooktop.com) that's nice, especially if you are on a budget.

09 Sep 2002 04:47 Avatar mgruenke

Re: Check out XML-Extractor, for JavaDoc-like applications

Oops, I forgot to include a link. Here:
https://sourceforge.net/projects/xml-extractor/ (https://sourceforge.net/projects/xml-extractor/)


The Freshmeat project page is:
http://freshmeat.net/projects/xml-extractor/ (http://freshmeat.net/projects/xml-extractor/)


BTW, also check out Lars Marius Garshol's DTDDoc, for documenting your XML vocabularies:
http://www.garshol.priv.no/download/software/dtddoc/ (http://www.garshol.priv.no/download/software/dtddoc/)

09 Sep 2002 04:10 Avatar mgruenke

Check out XML-Extractor, for JavaDoc-like applications
Since this post is about XML software, I hope no one will mind if I use it to spread awareness of my little project.

XML-Extractor was originally described as "sourcecode XML metadata extraction tools". It consists of tools for extracting and transforming XML-like mark-up, embedded in source code comments, into proper external entities or well-formed XML files. It can be used for JavaDoc-like "literate programming", or embedding other build-related or CM metadata.

It was originally written to facilitate the authoring and maintenance of the functional specification and design documentation for embedded firmware libraries and applications, written in assembly language (there is no other language supported on this device).

The idea is that you use XML-like markup in your source code comments. You can use whatever kind of vocabulary you like - the tool only looks at the syntax (similar to the way non-validating XML parsers are vocabulary-independent). It extracts the markup into well-formed XML - either external entities, or XML documents. Then, you can use XML-based applications or XSLT to process the results and produce documents (we run the output through another processing stage, and then use XSLT to produce XML DocBook).

Release 0.3.0 just went out. 1.0 is right around the corner (I want to get some more mileage on a few new features, but it's pretty stable & mature).

15 Jul 2002 10:43 Avatar bkorb

AutoGen transforms XML to any kind of text
AutoGen (http://autogen.sf.net/)
now includes a program for extracting XML data and
plugging it into a template describing an output
file. This makes it fairly easy to produce simple
reports (or program fragments) in legible text
derived from the XML information. <a
href="http://autogen.sourceforge.net/doc/autogen_8.html#SEC277">xml2ag is the AutoGen wrapper
that provides the functionality. If that link has
grown stale, look for it in the <a
href="http://autogen.sourceforge.net/doc/autogen_toc.html">Table of Contents.

29 Jun 2002 18:31 Avatar lsh

very strange list of packets is observed....
.. for example, I am really surprised that LibXSLT (XSLT engine based on LibXML), XML editor MLView and few other XML packages were not included in this review. Also a lot of interesting features like XInclude, SGML support, etc. were also left outside.
As a comment on the review quality I have to note that the statement &quot;C++ based XSLT engines can run *slightly* faster than their Java-based cousins&quot; is not near close to reality. I would suggest the author to compare, for example, LibXML + LibXSLT and Xalan results.
Probably if the author had contacted the authors of the reviewed packages then the review would have been much better.

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.