by Kip Hampton
Over the last several months we have explored some the of ways that Perl's XML modules can by used to create complex modern Web publishing systems. Also, the growing success of projects like AxKit, Bricolage, and others shows the combination of Perl and XML to be quite capable for creating large-scale applications. However, the more conceptual we have looked at here recently, together with the fact that the Perl/XML combination is most often seen in these types of complex systems seems to be giving the impression to the larger Perl community that processing XML with Perl tools is somehow, itself, complex and only worth the effort for big projects.
The truth is that putting Perl's XML processing facilities to work is no harder than using any other part of Perl; and if the applications that feature Perl/XML in a visible way are complex, it is because the problems that those applications are designed to solve are complex. To drive this point home, this month we will get back to our Perlish roots by examining how Perl can be used on the command line to perform a range of common XML tasks.
For our first few examples we will focus on those modules that ship with command line tools as part of their distributions.
XML::XPath
and the xpath
UtilityRequires: XML::Path, XML::Parser
Matt Sergeant's fine XML::XPath
module provides a way access the contents
of XML documents using the W3C-reccomended the XPath Language. This module
also installs a Perl utility called xpath
that allows XPath expressions to be
used to examine the contents of XML documents. The XML document can be specified either
by passing in the path to file on the disk as the first argument, or by piping the document
in via STDIN.
Find all section titles in a DocBook XML:
xpath mybook.xml //section/title
The same, using a pipe instead:
cat files/mybook.xml | xpath //section/title
Retrieve just the significant text (not including nodes containing all-whitespace) from a given document:
xpath somefile.xml "//text()[string-length(normalize-space(.)) > 0 ]"
sql2xml
UtilityRequires DBIx::XML_RDB, DBI
Fans of Matt's popular DBIx::XML_RDB
module will be pleased to know that
it too ships with a command line tool, sql2xml
, that returns an entire
database table as a single XML document.
Save the data stored in the 'users' table as the file users.xml
:
sql2xml.pl -sn myserver -driver Oracle -uid user -pwd seekrit -table user -output users.xml
The same, but send the data to STDOUT:
sql2xml.pl -sn myserver -driver Oracle -uid user -pwd seekrit -table user -output -
xmlpretty
UtilityRequires: XML::Handler::YAWriter, XML::Parser::PerlSAX
No matter how careful folks are about indenting and so on while editing XML documents, they often
need reformatting to be reasonably called "human-readable". Michael Koehne's XML::Handler::YAWriter
SAX Handler installs an XML pretty-printer called xmlpretty
that reduces this
task to a quick one-liner.
Passing a file name:
xmlpretty overwrought.xml > new.xml
The same, reading from STDIN:
cat overwrought.xml | xmlpretty > new.xml
xmlsemdiff
UtilityRequires: XML::SemanticDiff, XML::Parser
Unfortunately, standard command line text-processing tools like diff
often fall
short when dealing with XML documents. My XML::SemanticDiff
was designed to make
comparing the relevant parts of two XML documents (while ignoring things like extra whitespace,
or having the same namespace URI bound to different prefixes) easy and straightforward.
Newer versions of this module install the xmlsemdiff
tool, which allows simple
access from the shell.
Print the semantic differences between two XML documents to STDOUT
xmlsemdiff file1.xml file2.xml
The Apache Software Foundation's Xerces-Perl project offers a Perl interface to
the Xerces C++ XML parser. Xerces-Perl ships with several sample scripts that can be copied
into your favorite bin
directory. The most notable difference between Xerces
and the other XML parsers available to the Perl World is that it provides a way to validate
XML documents against W3C XML Schemas.
Calculate the time needed to process an XML document while validating it against an XML Schema:
DOMCount.pl -v=auto -s mydoc.xml
xmllint
Developers using XML::LibXML
for Perl XML processing often aren't aware
of the feature-rich command line XML processing tool xmllint
that installed
with the C libraries that XML::LibXML
depends upon. No, xmllint
is not a Perl tool, but it's many features, and the fact that it can be easily piped together
with other tools makes it more than worthy of mention here.
Use the built-in HTML parser to convert not-well-formed HTML to XML before further processing:
xmllint --html khampton_perl_xml_17.html | xpath "//a[@href]"
Similar, but using the DocBook SGML parser:
xmllint --sgml ye-olde.sgml | xpath "//chapter[@id='chapt4']"
Using xmllint
as a pretty-printer:
cat some.xml | xmllint --format
Using xmllint
to validate a document against an external DTD:
cat some.xml | xmllint --postvalid --dtdvalid my.dtd
Requires: Devel::TraceSAX, XML::SAX, XML::SAX::Machines
While the synax may be a bit verbose, it is entirely possible to use XML::SAX::Machines
to bring the power of Perl SAX2 to the command line.
Using XML::SAX::Machines
to produce an XML document to STDOUT after applying a SAX filter.
perl -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::MyFilter", \*STDOUT)->parse_uri("files/camelids.xml");'
The same, reading from STDIN:
cat files/camelids.xml | perl -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::MyFilter", \*STDOUT)->parse_string(join "", <STDIN>);'
Also, it is often very helpful when writing custom SAX Filters to be able to examine what
events are being generated and forwarded to which classes. Barrie Slaymaker's
Devel::TraceSAX
makes this a painless process.
Debugging SAX events by tracing them through multiple filters:
perl -d:TraceSAX -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::Filter1", "XML::Filter2")->parse_uri("file.xml");'
Processing XML with Perl does not mean buying into a huge XML-centric application with a steep learning curve, or departing from Perl's long history as a command line tool. You may not use all of the tools or techniques described here, but it is nice to know that they are available when and if you need them.