Articles / Build Your Own XML Parser

Build Your Own XML Parser

Recently, I needed to write a script that would parse an XML file and extract various bits of information. I'm sure there are plenty of excellent XML modules for Perl, but I didn't want to go through the pain of having to find one and install it (along with its tree of dependencies). Besides, I was sure that I was dealing with well-formed XML, and all I was doing was extracting fields, so I didn't need error-checking, XSLT, XInput, and all that fancy stuff. I just rolled my own XML parser in about 100 lines. It's not fancy. It makes all sorts of assumptions that will cause it to break in a production environment, but I thought I'd show how it's built.

Outer Loop

"<" and ">" are reserved characters, and can only appear at the beginning and end of tags. If you're used to lex-style parsers, you're thinking "Ah, so I should read until I see a "<" or ">". But in Perl, you can just read the input until you see a ">", and then you know that you've got, at the very least, a tag. So the outline of the parser is:

$/ = ">";

while (<>)
{
	…
}

Note the use of $/ to set the input record separator. This means that if our input file is

<?xml version="1.0" ?>
<address>
  <friend />
  <name>John <nickname>Spike</nickname> Smith</name>
  <streetAddress>123 Maple Ave.</streetAddress>
</address>

then successive values of $_ will be

  1. "<?xml version="1.0" ?>"
  2. "<address>"
  3. " <friend />"
  4. " <name>"
  5. "John<nickname>"
  6. "Spike</nickname>"
  7. " Smith</name>"
  8. " <streetAddress>"
  9. "123 Maple Ave.</streetAddress>"
  10. "</address>"

In other words, $_ always ends with a complete XML tag, which might be preceded by other text. So, the first order of business is to separate the tag from the text that precedes it:

	my $text;
	my $tag;

	m{^(.*)<(.*?)>$}s;
	$text = $1;
	$tag = $2;

(For those who don't remember, m{pattern} is equivalent to /pattern/.) The "s" modifier is there so that a dot will match anything, including a newline.

Data Representation

Now that we have some content, it would be a good idea to think about how the data in the XML file will be represented. XML is a nested set of elements, each of which has a name, optional attributes, and the contents (the stuff between <foo> and </foo>). The contents can be either text data or other elements. I chose a rather simple data structure to hold an element's data, an array of the form "(name, attributes, contents…)". Thus, the streetAddress element in the example above would be turned into:

( "streetAddress", "", "123 Maple Ave." )

The name element contains three items: the strings "John" and "Smith", and a <nickname> element. It would be represented as:

( "name",
  "",
  "John",
  [ "nickname",
    "",
    "Spike",
  ],
  " Smith",
)

Eventually, we want to store the entire file in a treelike list like this.

Context

When we see an open tag like <foo>, we're going to start parsing it. Anything that we read will go inside that <foo> element until we see the closing tag, at which point we're done with the <foo> tag, and should go back to whichever element we were processing before (the element that contains the <foo> element). Naturally, this suggests a context stack.

In my code, I did a somewhat bad thing. I used two variables with the same name. @context is the context stack (stacks are naturally represented using arrays), and $context is a reference-to-array which refers to whichever element we're looking at at the moment.

At any given moment, $context is a reference to an array in the (name, attributes, contents…) format described above, and @context is a stack of such references-to-arrays, describing the elements inside which the current element is embedded.

That is, at some point, $context will be

[ "nickname", "" ]
and at that time, @context will be
(
  [ "address", "" ],
  [ "name", "", "John" ]
)

In the section on the main loop, we'd gotten as far as separating the tag from the text that precedes it. Now we see that the text part should be appended to the array that $context points to (never mind for now how $context was set up; we'll worry about that later):

our @tree = ("xml", "");	# Parsed XML tree
our $context = \@tree;		# Context stack
our @context = ();		# Current context

$/ = ">";

while (<>)
{
	last if $_ eq "";	# End of file
	# There might be text in front of the XML element.
	my $text;
	my $tag;

	m{^(.*)<(.*?)>$}s;
	$text = $1;
	$tag = $2;

	if ($text ne "")
	{
		push @{$context}, $text;
	}
	…
}

Parsing Tags

Now we just need to deal with the text in $tag. There are three types that we need to be concerned with:

  1. <foo>: an opening tag
  2. </foo>: a closing tag
  3. <foo />: a singleton tag

Opening and singleton tags can also have attributes after the tag name:

<name added="2007-07-19">
<address format="us-postal" category="personal">

In my script, I didn't need to worry about attributes, so I chose just to store them as raw, unparsed strings. If they matter to you, I suggest representing them as hashes that map each attribute's name to its value.

We can parse a tag using (what else?) a regular expression. This one's complicated enough that it's a good idea to use the "x" flag, which allows us to embed whitespace (including newlines) and comments inside the regular expression:

if ($tag =~ m{
	^(/?)		# Closing tag
	(\S+)		# Tag name
	# Optional attributes
	(
	 # <space> <attr-name> = "<attr-value>"
	 (?:\s+
	   [\w:]+	# Attribute name
	   =
	   \"[^\"]*\"	# Attribute value
	 )*
	)
	\s*
	(/?)		# Optional closing slash, for singleton tag
	$
	}xs)
{
	# This is a regular <foo>, </foo>, or <foo /> tag.
	my $closing = ($1 ne "");	# Is this a closing tag?
	my $name = $2;		# Tag name
	my $attrs = $3;		# Attribute assignments
	my $singleton = ($4 ne "");	# Is this a singleton tag?
	…
}

Now $name is the name of the tag, $attrs is the (unparsed) attribute string, and $closing and $singleton are boolean flags that tell us whether this is a closing or singleton tag, respectively.

Closing tags are easiest to deal with. We're done parsing the element, and the parsed data is in $context, so all we need to do is append it to its parent and pop the @context stack to return to that element:

if ($closing)
{
	# End of current context

	# Append $context to last item in @context
	push @{$context[$#context]}, $context;

	# Restore previous context from stack
	$context = pop @context;
	next;
}

This leaves us with opening tags and singleton tags. Singleton elements like <foo /> are equivalent to <foo></foo>. That is, a singleton has no children. In either case, though, we need to start a new context:

# Save the old context on the context stack
push @context, $context;

# Start a new context
$context = [ $name, $attrs ];

Of course, if we're looking at a singleton tag, then we already know that it has no contents and should be closed immediately. And since we've already taken care of closing tags, we already know how to close tags:

if ($singleton)
{
	push @{$context[$#context]}, $context;
	$context = pop @context;
}

Finally, we can tackle the hardest of the three tag types: opening tags.

For these, we need to save the old context to the stack and start a new context (which we've already done), and then… actually, that's all we need to do at this stage. We can't add the contents of the element, because we haven't read them from the input file yet.

Remember at the top, when we separated the tag from the text that preceded it, and appended the text to @{$context}? Now we see how $context was set up: We created a new anonymous array when we saw the opening tag, so that later iterations of the loop would have a place to put their text.

And that's pretty much it! You can read the full script, which includes a &dumptree function for printing the parsed tree and a &lookup function for looking up elements.

Lessons Learned

Input records don't have to end with a newline. If there's a more convenient record terminator or separator, use $/ to read the input in way that makes your life easier.

Perl's regular expressions are powerful. Don't bother trying to read the "<", then the tag name, then the attributes, then the ">", as you would with lex. Just read the whole thing and extract the interesting bits with parenthesized expressions inside a regular expression.

If you're trying to solve a difficult problem, try to break it down into not-quite-so-difficult problems, and break those down into easier subproblems. Take care of the easy parts first. This allows you to make simplifying assumptions (if we know that we're not looking at a closing tag, we know that we need to save the old context and start a new one). Once you've taken care of enough easy bits, you might find that your hard problem has been simplified to the point that you don't need to do anything at all.

RSS Recent comments

25 Aug 2007 09:54 Avatar jengelh

lex me
&gt;Perl's regular expressions are powerful. Don't bother trying to read the &quot;&lt;&quot;, then the tag name, then the attributes, then the &quot;&gt;&quot;, as you would with lex.

In the end, Perl has to read each character (optimizations using a buffer and reading more kilobytes at once aside), so as to give you the behavior that $/ = &quot;&gt;&quot; brings. Just lexing done in a smart package for you, sort of. Since I primarily use C, and Perl does the same, I suppose one would use strchr(s, '&gt;') [or strstr] here to find the end of a tag. (And actually, I'd use while(isspace(*s)) ++s; after strchr()/strstr() so that the whitespace is at the end of strings rather than in front - just a miniature optimization.)

25 Aug 2007 11:02 alexhudson

Use standard XML libraries
This really does undermine the reason for using XML, and I can't think how this could be recommended in any circumstance.

It doesn't take into account character set encoding, or CDATA sections, or entities, or namespaces, or a host of other things - it's not just fragile to bad XML, it'll fall over with all sorts of legal XML.

It's surely quicker to learn to use a standard library that parses XML correctly than to hand-roll your own parser which doesn't conform to the XML spec?

I guess if anything, this is a sad commentary on the state of XML in Perl - there ought to be a library which comes by default, and hacks like this would be unnecessary.

25 Aug 2007 12:59 kreiger

Re: Use standard XML libraries

> This really does undermine the reason

> for using XML, and I can't think how

> this could be recommended in any

> circumstance.

Agreed. One can only hope that nobody follows this example.

This does not parse XML. This parses a language somewhat similar to XML.

That right there is a huge difference, and the author does his readers a great disservice by referring to his newly invented language as "XML".

25 Aug 2007 13:34 tehmoth

Re: Use standard XML libraries

> I guess if anything, this is a sad

> commentary on the state of XML in Perl -

> there ought to be a library which comes

> by default, and hacks like this would be

> unnecessary.

>

Hardly. By not including a default XML library in the standard installation, Perl allows programmers the freedom to decide which XML Library to use, or to write their own (and to not have to install one if they never have to deal with XML). Yes, the "XML" library above is not going to work on a block of XML from some random source, but that is not what was required. It works for him, and his data. Writing your own parser for data in a known format is not a "hack", nor is it "required" by "the sad state of XML in Perl". If it was a completely new format that he was parsing it wouldn't be considered a hack (except perhaps in the positive sense), but unfortunately the fashion these days is to transform everything into XML.

If he needed to support "character set encoding, or CDATA sections, or entities, or namespaces, or a host of other things" then he would probably have bitten the bullet and learned an existing XML library. His introduction even states "It's not fancy. It makes all sorts of assumptions that will cause it to break in a production environment,". I'm not sure what your reasons for using XML would be but this example

hardly "undermines the reason for using XML".

25 Aug 2007 15:34 endecotp

Not a great advert for Freshmeat!
&gt; I'm sure there are plenty of excellent XML modules for Perl, but I

&gt; didn't want to go through the pain of having to find one and

&gt; install it

What an odd article to post on Freshmeat! This is a site which is designed to make it easy to find software that does what you need, so that you don't have to roll your own inevitably-inferior solution. But you're saying that, despite sites like Freshmeat and CPAN, it would be a &quot;pain&quot; to find a Perl XML parser - so much so that you've written your own.

Hmm. Freshmeat admins, take note. You might like to interview this guy and find out what it is about your site that he finds so difficult to use.

11 Sep 2007 02:58 ronfox

Why?
&quot;, fun coding by night, and obeying the whims of his feline overlords.&quot;

That's why everyone knows cats are just plain strange ;-)

19 Oct 2007 01:09 jzhang2005

vtd-xml
You should try vtd-xml

vtd-xml.sf.net

Screenshot

Project Spotlight

juntaDados

A GNU/Linux multimedia distribution that is targeted at audio, video, and graphics producers.

Screenshot

Project Spotlight

GNOME Commander

A GNOME based filemanager.