Archive for August 31st, 2007

Perl XML Parsing

Friday, August 31st, 2007

Originally, I built the framework for this blog entirely on PHP, using XML to store my articles, comments, and all other data (see this and this for a detailed explanation of the original structure). Unfortunately, this design didn’t lend itself to scalability or adding functionality in terms of XML parsing. In contrast, perl has a DOM-based parser that makes parsing xml nice and easy. I decided to have a go at writing an xml parser in perl that has identical functionality to the php one I’m using currently thinking that I’d be a lot easier to enhance my blog.

I’ll start off by briefly explaining both parsers. PHP’s is handle based, meaning that the parser object starts at the top of the document, and whenever it gets to a tag, it calls either a start tag handler, or an end tag handler. When the parser gets to text between tags, it calls a data handler. The parser is good for crawling a file top down, but it’s not that easy to use if you’re looking for specific tags scattered throughout the file. You have to have dummy handlers that don’t do anything and keep switching the handlers when you come across the tag that you’re looking for. It’s also not very good for counting the instances of a specific tag because you may have to use global variables (generally a bad practice that should be avoided if possible) as counters. What’s more, each handler function is essentially just an if/elseif ladder that does something depending on what tag it’s analyzing. What’s more, adding functionality meant changing every handler to recognize a certain tag, or to change behavior when at a certain tag. In my case, I had about 12 different handlers, things were really redundant, and adding functionality was enough of a pain that I just decided not to do it (not the right solution). Using the PHP parser lead to really sloppy code and made it hard to add features to the parser.

In contrast, the perl module XML::DOM is a DOM-based parser, it works a lot like javascript parsing of HTML pages. You can call methods like “getElementsByTagName”, “getAttributes”, and “getChildren” on any node to extract information from that node. It’s pretty simple, easy to use, and the online documentation at CPAN is very useful.

So why switch over from PHP to Perl when it involves hours of reprogramming already working code? For one, I want to add functionality. If I kept the parser in PHP, this would take a lot of time and cause me plenty of frustration. Now that things are in perl, adding a feature is a straightforward as writing a simple subroutine. What’s more, the perl parser eliminates all redundancies, I’ve modularized everything into subroutines so that I never have repeated code in my parser. My perl code is also a lot cleaner, I don’t have long if ladders, instead I have a hash that maps tag names to subroutines. Everything in the perl parser just seems a lot cleaner than that PHP one, and I don’t regret spending a couple of hours to fix up my code.

One of the cooler aspects of the perl XML Parser that I’ve written is that I’m using mutual recursion very liberally. I have a couple of main functions: one for generating the html for multiple stories, one for generating html for just one story, one for getting all the titles of the stories in a given feed, and one for retrieving the latest story it. The first two functions call a private parse_story method, which goes into the mutual recursion. Each tag name in the story is a key in a hash that maps tags to subroutines. I iterate through all the child tags, and execute that subroutine. Each of these subroutines spits out some html and then tries to parse all the children again, recursively. With this recursion, I only need one subroutine per valid tag name, and the XML is parsed similarly to performing recursive algorithms on a tree structure. To add support for a tag, I just have to add it to the hash table, and build it’s subroutine. With this mutual recursion, the parsing code is much simpler and a lot easier to build upon.

If you notice, all of my served pages are still in PHP, so how am I calling this perl parser from php? I’m using PHP’s shell_exec function, which executes a shell command and returns a string of all the output from that shell command. The perl parser returns a string of html that my web pages echo out to the client. I don’t think this is optimal in terms of performance because I have both php and perl running simultaneously, but I didn’t want to build my entire framework in perl, so it’s a hacked solution.

So perl’s the way to go for XML parsing. It’s relatively easy, and allows for maintainable code (unlike PHP’s parser). Unfortunately, perl’s not nearly as popular for web scripting as PHP is, so if I wanted to distribute this, I’d probably have to use a PHP parser. Anyway, I’ve had the perl parser running on the site for about two weeks now, and I haven’t received complaints or seen any bugs, so things seem to be working well.