Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Thu, 17 Jul 2025 19:47:24 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20080423063359 location: https://web.archive.org/web/20080423063359/https://www.oreilly.com/catalog/perlxml/toc.html server-timing: captures_list;dur=0.525370, exclusion.robots;dur=0.020042, exclusion.robots.policy;dur=0.009947, esindex;dur=0.009359, cdx.remote;dur=42.770085, LoadShardBlock;dur=407.176902, PetaboxLoader3.datanode;dur=118.106537, PetaboxLoader3.resolve;dur=108.399159 x-app-server: wwwb-app212 x-ts: 302 x-tr: 473 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: SERVER=wwwb-app212; path=/ x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Thu, 17 Jul 2025 19:47:25 GMT content-type: text/html x-archive-orig-date: Wed, 23 Apr 2008 07:29:28 GMT x-archive-orig-server: Apache x-archive-orig-p3p: policyref="https://www.oreillynet.com/w3c/p3p.xml",CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAa IVDa CONo OUR DELa PUBi OTRa IND PHY ONL UNI PUR COM NAV INT DEM CNT STA PRE" x-archive-orig-last-modified: Thu, 03 Apr 2008 01:32:48 GMT x-archive-orig-accept-ranges: bytes x-archive-orig-content-length: 355808 x-archive-orig-x-cache: MISS from olive.bp x-archive-orig-x-cache-lookup: MISS from olive.bp:3128 x-archive-orig-via: 1.0 olive.bp:3128 (squid/2.6.STABLE13) x-archive-orig-connection: close x-archive-guessed-content-type: text/html x-archive-guessed-charset: utf-8 memento-datetime: Wed, 23 Apr 2008 06:33:59 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Tue, 11 Jun 2002 06:13:50 GMT", ; rel="prev memento"; datetime="Wed, 20 Feb 2008 10:21:04 GMT", ; rel="memento"; datetime="Wed, 23 Apr 2008 06:33:59 GMT", ; rel="next memento"; datetime="Mon, 22 Sep 2008 23:57:23 GMT", ; rel="last memento"; datetime="Sat, 09 Jun 2012 08:37:02 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 51_3_20080423042857_crawl109-c/51_3_20080423061649_crawl100.arc.gz server-timing: captures_list;dur=0.617352, exclusion.robots;dur=0.021245, exclusion.robots.policy;dur=0.009314, esindex;dur=0.011555, cdx.remote;dur=151.413045, LoadShardBlock;dur=316.831064, PetaboxLoader3.resolve;dur=324.133798, PetaboxLoader3.datanode;dur=355.151461, load_resource;dur=511.825064 x-app-server: wwwb-app212 x-ts: 200 x-tr: 1213 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip O'Reilly Media | Perl and XML

Buy this Book

Print Book $34.95

PDF $24.99

PDF Chapter $3.99

Read it Now!

Print Book £24.95

Reprint Licensing

Tell a friend

Perl and XML

By Erik T. Ray, Jason McIntosh
Book Price: $34.95 USD
£24.95 GBP
PDF Price: $24.99

Cover | Table of Contents | Colophon

Chapter 1: Perl and XML

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Perl is a mature but eccentric programming language that is tailor-made for text manipulation. XML is a fiery young upstart of a text-based markup language used for web content, document processing, web services, or any situation in which you need to structure information flexibly. This book is the story of the first few years of their sometimes rocky (but ultimately happy) romance.

First and foremost, Perl is ideal for crunching text. It has filehandles, "here" docs, string manipulation, and regular expressions built into its syntax. Anyone who has ever written code to manipulate strings in a low-level language like C and then tried to do the same thing in Perl has no trouble telling you which environment is easier for text processing. XML is text at its core, so Perl is uniquely well suited to work with it.

Furthermore, starting with Version 5.6, Perl has been getting friendly with Unicode-flavored character encodings, especially UTF-8, which is important for XML processing. You'll read more about character encoding in Chapter 3.

Second, the Comprehensive Perl Archive Network (CPAN) is a multimirrored heap of modules free for the taking. You could say that it takes a village to make a program; anyone who undertakes a programming project in Perl should check the public warehouse of packaged solutions and building blocks to save time and effort. Why write your own parser when CPAN has plenty of parsers to download, all tested and chock full of configurability? CPAN is wild and woolly, with contributions from many people and not much supervision. The good news is that when a new technology emerges, a module supporting it pops up on CPAN in short order. This feature complements XML nicely, since it's always changing and adding new accessory technologies.

Early on, modules sprouted up around XML like mushrooms after a rain. Each module brought with it a unique interface and style that was innovative and Perlish, but not interchangeable. Recently, there has been a trend toward creating a universal interface so modules can be interchangeable. If you don't like this SAX parser, you can plug in another one with no extra work. Thus, the CPAN community does work together and strive for internal coherence.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Why Use Perl with XML?

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Third, Perl's flexible, object-oriented programming capabilities are very useful for dealing with XML. An XML document is a hierarchical structure made of a single basic atomic unit, the XML element, that can hold other elements as its children. Thus, the elements that make up a document can be represented by one class of objects that all have the same, simple interface. Furthermore, XML markup encapsulates content the way objects encapsulate code and data, so the two complement each other nicely. You'll also see that objects are useful for modularizing XML processors. These objects include parser objects, parser factories that serve up parser objects, and parsers that return objects. It all adds up to clean, portable code.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Is Simple with XML::Simple

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Many people, understandably, think of XML as the invention of an evil genius bent on destroying humanity. The embedded markup, with its angle brackets and slashes, is not exactly a treat for the eyes. Add to that the business about nested elements, node types, and DTDs, and you might cower in the corner and whimper for nice, tab-delineated files and a split function.

Here's a little secret: writing programs to process XML is not hard. A whole spectrum of tools that handle the mundane details of parsing and building data structures for you is available, with convenient APIs that get you started in a few minutes. If you really need the complexity of a full-featured XML application, you can certainly get it, but you don't have to. XML scales nicely from simple to bafflingly complex, and if you deal with XML on the simple end of the continuum, you can pick simple tools to help you.

To prove our point, we'll look at a very basic module called XML::Simple , created by Grant McLean. With minimal effort up front, you can accomplish a surprising amount of useful work when processing XML.

A typical program reads in an XML document, makes some changes, and writes it back out to a file. XML::Simple was created to automate this process as much as possible. One subroutine call reads in an XML document and stores it in memory for you, using nested hashes to represent elements and data. After you make whatever changes you need to make, call another subroutine to print it out to a file.

Let's try it out. As with any module, you have to introduce XML::Simple to your program with a use pragma like this:

use XML::Simple;

When you do this, XML::Simple exports two subroutines into your namespace:

XMLin( ): This subroutine reads an XML document from a file or string and builds a data structure to contain the data and element structure. It returns a reference to a hash containing the structure.
XMLout( ): Given a reference to a hash containing an encoded document, this subroutine generates XML markup and returns it as a string of text.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Processors

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Now that you see the easy side of XML, we will expose some of XML's quirks. You need to consider these quirks when working with XML and Perl.

When we refer in this book to an XML processor (which we'll often refer to in shorthand as a processor, not to be confused with the central processing unit of a computer system that has the same nickname), we refer to software that can either read or generate XML documents. We use this term in the most general way—what the program actually does with the content it might find in the XML it reads is not the concern of the processor itself, nor is it the processor's responsibility to determine the origin of the document or decide what to do with one that is generated.

As you might expect, a raw XML processor working alone isn't very interesting. For this reason, a computer program that actually does something cool or useful with XML uses a processor as just one component. It usually reads an XML file and, through the magic of parsing, turns it into in-memory structures that the rest of the program can do whatever it likes with.

In the Perl world, this behavior becomes possible through the use of Perl modules: typically, a program that needs to process XML embraces, through a use directive, an existing package that makes a programmer interface available (usually an object-oriented one). This is why, before they get down to business, many XML-handling Perl programs start out with

use
XML::Parser;

or something similar. With one little line, they're able to leave all the dirty work of XML parsing to another, previously written module, leaving their own code to decide what to do pre- and post-processing.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

A Myriad of Modules

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

One of Perl's strengths is that it's a community-driven language. When Perl programmers identify a need and write a module to handle it, they are encouraged to distribute it to the world at large via CPAN. The advantage of this is that if there's something you want to do in Perl and there's a possibility that someone else wanted to do it previously, a Perl module is probably already available on CPAN.

However, for a technology that's as young, popular, and creatively interpretable as XML, the community-driven model has a downside. When XML first caught on, many different Perl modules written by different programmers appeared on CPAN, seemingly all at once. Without a governing body, they all coexisted in inconsistent glee, with a variety of structures, interfaces, and goals.

Don't despair, though. In the time since the mist-enshrouded elder days of 1998, a movement towards some semblance of organization and standards has emerged from the Perl/XML community (which primarily manifests on ActiveState's perl-xml mailing list, as mentioned in the preface). The community built on these first modules to make tools that followed the same rules that other parts of the XML world were settling on, such as the SAX and DOM parsing standards, and implemented XML-related technologies such as XPath. Later, the field of basic, low-level parsers started to widen. Recently, some very interesting systems have emerged (such as XML::SAX) that bring truly Perlish levels of DWIMminess out of these same standards.

Of course, the goofy, quick-and-dirty tools are still there if you want to use them, and XML::Simple is among them. We will try to help you understand when to reach for the standards-using tools and when it's OK to just grab your XML and run giggling through the daffodils.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Keep in Mind...

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

In many cases, you'll find that the XML modules on CPAN satisfy 90 percent of your needs. Of course, that final 10 percent is the difference between being an essential member of your company's staff and ending up slated for the next round of layoffs. We're going to give you your money's worth out of this book by showing you in gruesome detail how XML processing in Perl works at the lowest levels (relative to any other kind of specialized text munging you may perform with Perl). To start, let's go over some basic truths:

It doesn't matter where it comes from.
By the time the XML parsing part of a program gets its hands on a document, it doesn't give a camel's hump where the thing came from. It could have been received over a network, constructed from a database, or read from disk. To the parser, it's good (or bad) XML, and that's all it knows.
Mind you, the program as a whole might care a great deal. If we write a program that implements XML-RPC, for example, it better know exactly how to use TCP to fetch and send all that XML data over the Internet! We can have it do that fetching and sending however we like, as long as the end product is the same: a clean XML document fit to pass to the XML processor that lies at the program's core.
We will get into some detailed examples of larger programs later in this book.
Structurally, all XML documents are similar.
No matter why or how they were put together or to what purpose they'll be applied, all XML documents must follow the same basic rules of well-formedness: exactly one root element, no overlapping elements, all attributes quoted, and so on. Every XML processor's parser component will, at its core, need to do the same things as every other XML processor. This, in turn, means that all these processors can share a common base. Perl XML-processing programs usually observe this in their use of one of the many free parsing modules, rather than having to reimplement basic XML parsing procedures every time.
Furthermore, the one-document, one-element nature of XML makes processing a pleasantly fractal experience, as any document invoked through an external entity by another document magically becomes "just another element" within the invoker, and the same code that crawled the first document can skitter into the meat of any reference (and anything to which the reference might refer) without batting an eye.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Gotchas

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

This section introduces topics we think you should keep in mind as you read the book. They are the source of many of the problems you'll encounter when working with XML.

Well-formedness: XML has built-in quality control. A document has to pass some minimal syntax rules in order to be blessed as well-formed XML. Most parsers fail to handle a document that breaks any of these rules, so you should make sure any data you input is of sufficient quality.
Character encodings: Now that we're in the 21st century, we have to pay attention to things like character encodings. Gone are the days when you could be content knowing only about ASCII, the little character set that could. Unicode is the new king, presiding over all major character sets of the world. XML prefers to work with Unicode, but there are many ways to represent it, including Perl's favorite Unicode encoding, UTF-8. You usually won't have to think about it, but you should still be aware of the potential.
Namespaces: Not everyone works with or even knows about namespaces. It's a feature in XML whose usefulness is not immediately obvious, yet it is creeping into our reality slowly but surely. These devices categorize markup and declare tags to be from different places. With them, you can mix and match document types, blurring the distinctions between them. Equations in HTML? Markup as data in XSLT? Yes, and namespaces are the reason. Older modules don't have special support for namespaces, but the newer generation will. Keep it in mind.
Declarations: Declarations aren't part of the document per se; they just define pieces of it. That makes them weird, and something you might not pay enough attention to. Remember that documents often use DTDs and have declarations for such things as entities and attributes. If you forget, you could end up breaking something.
Entities: Entities and entity references seem simple enough: they stand in for content that you'd rather not type in at that moment. Maybe the content is in another file, or maybe it contains characters that are difficult to type. The concept is simple, but the execution can be a royal pain. Sometimes you want to resolve references and sometimes you'd rather keep them there. Sometimes a parser wants to see the declarations; at other times it doesn't care. Entities can contain other entities to an arbitrary depth. They're tricky little beasties and we guarantee that if you don't give careful thought to how you're going to handle them, they will haunt you.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 2: An XML Recap

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML is a revolutionary (and evolutionary) markup language. It combines the generalized markup power of SGML with the simplicity of free-form markup and well-formedness rules. Its unambiguous structure and predictable syntax make it a very easy and attractive format to process with computer programs.

You are free, with XML, to design your own markup language that best fits your data. You can select element names that make sense to you, rather than use tags that are overloaded and presentation-heavy. If you like, you can formalize the language by using element and attribute declarations in the DTD.

XML has syntactic shortcuts such as entities, comments, processing instructions, and CDATA sections. It allows you to group elements and attributes by namespace to further organize the vocabulary of your documents. Using the xml:space attribute can regulate whitespace, sometimes a tricky issue in markup in which human readability is as important as correct formatting.

Some very useful technologies are available to help you maintain and mutate your documents. Schemas, like DTDs, can measure the validity of XML as compared to a canonical model. Schemas go even further by enforcing patterns in character data and improving content model syntax. XSLT is a rich language for transforming documents into different forms. It could be an easier way to work with XML than having to write a program, but isn't always.

This chapter gives a quick recap of XML, where it came from, how it's structured, and how to work with it. If you choose to skip this chapter (because you already know XML or because you're impatient to start writing code), that's fine; just remember that it's here if you need it.

Early text processing was closely tied to the machines that displayed it. Sophisticated formatting was tied to a particular device—or rather, a class of devices called printers.

Take troff, for example. Troff was a very popular text formatting language included in most Unix distributions. It was revolutionary because it allowed high-quality formatting without a typesetting machine.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

A Brief History of XML

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Early text processing was closely tied to the machines that displayed it. Sophisticated formatting was tied to a particular device—or rather, a class of devices called printers.

Troff mixes formatting instructions with data. The instructions are symbols composed of characters, with a special syntax so a troff interpreter can tell the two apart. For example, the symbol \fI changes the current font style to italic. Without the backslash character, it would be treated as data. This mixture of instructions and data is called markup.

Troff can be even more detailed than that. The instruction .vs 18p tells the formatter to insert 18 points of vertical space at whatever point in the document where the instruction appears. Beyond aesthetics, we can't tell just by looking at it what purpose this spacing serves; it gives a very specific instruction to the processor that can't be interpreted in any other way. This instruction is fine if you only want to prepare a document for printing in a specific style. If you want to make changes, though, it can be quite painful.

Suppose you've marked up a book in troff so that every newly defined term is in boldface. Your document has thousands of bold font instructions in it. You're happy and ready to send it to the printer when suddenly, you get a call from the design department. They tell you that the design has changed and they now want the new terms to be formatted as italic. Now you have a problem. You have to turn every bold instruction for a new term into an italic instruction.

Your first thought is to open the document in your editor and do a search-and-replace maneuver. But, to your horror, you realize that new terms aren't the only places where you used bold font instructions. You also used them for emphasis and for proper nouns, meaning that a global replace would also mangle these instances, which you definitely don't want. You can change the right instructions only by going through them one at a time, which could take hours, if not days.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Markup, Elements, and Structure

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

A markup language provides a way to embed instructions inside data to help a computer program process the data. Most markup schemes, such as troff, TeX, and HTML, have instructions that are optimized for one purpose, such as formatting the document to be printed or to be displayed on a computer screen. These languages rely on a presentational description of data, which controls typeface, font size, color, or other media-specific properties. Although such markup can result in nicely formatted documents, it can be like a prison for your data, consigning it to one format forever; you won't be able to extract your data for other purposes without significant work.

That's where XML comes in. It's a generic markup language that describes data according to its structure and purpose, rather than with specific formatting instructions. The actual presentation information is stored somewhere else, such as in a stylesheet. What's left is a functional description of the parts of your document, which is suitable for many different kinds of processing. With proper use of XML, your document will be ready for an unlimited variety of applications and purposes.

Now let's review the basic components of XML. Its most important feature is the element. Elements are encapsulated regions of data that serve a unique role in your document. For example, consider a typical book, composed of a preface, chapters, appendixes, and an index. In XML, marking up each of these sections as a unique element within the book would be appropriate. Elements may themselves be divided into other elements; you might find the chapter's title, paragraphs, examples, and sections all marked up as elements. This division continues as deeply as necessary, so even a paragraph can contain elements such as emphasized text, quotations, and hypertext links.

Besides dividing text into a hierarchy of regions, elements associate a label and other properties with the data. Every element has a name, or element type , usually describing its function in the document. Thus, a chapter element could be called a "chapter" (or "chapt" or "ch"—whatever you fancy). An element can include other information besides the type, using a name-value pair called an

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Namespaces

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

It's sometimes useful to divide up your elements and attributes into groups, or namespaces . A namespace is to an element somewhat as a surname is to a person. You may know three people named Mike, but no two of them have the same last name. To illustrate this concept, look at the document in Example 2-2.

Example 2-2. A document using namespaces

<?xml version="1.0"?>
<report>
  <title>Fish and Bicycles: A Connection?</title>
  <para>I have found a surprising relationship
  of fish to bicycles, expressed by the equation 
  <equation>f = kb+n</equation>. The graph below illustrates
  the data curve of my experiment:</para>
  <chart xmlns:graph="https://mathstuff.com/dtds/chartml/">
    <graph:dimension>
      <graph:axis>fish</graph:axis>
      <graph:start>80</graph:start>
      <graph:end>99</graph:end>
      <graph:interval>1</graph:interval>
    </graph:dimension>
    <graph:dimension>
      <graph:axis>bicycle</graph:axis>
      <graph:start>0</graph:start>
      <graph:end>1000</graph:end>
      <graph:interval>50</graph:interval>
    </graph:dimension>
    <graph:equation>fish=0.01*bicycle+81.4</graph:equation>
  </graph:chart>
</report>

Two namespaces are at play in this example. The first is the default namespace, where elements and attributes lack a colon in their name. The elements whose names contain graph: are from the "chartml" namespace (something we just made up). graph: is a namespace prefix that, when attached to an element or attribute name, becomes a qualified name . The two <equation> elements are completely different element types, with a different role to play in the document. The one in the default namespace is used to format an equation literally, and the one in the chart namespace helps a graphing program generate a curve.

A namespace must always be declared in an element that contains the region where it will be used. This is done with an attribute of the form xmlns: prefix=URL, where prefix is the namespace prefix to be used (in this case, graph:) and URL is a unique identifier in the form of a URL or other resource identifier. Outside of the scope of this element, the namespace is not recognized.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Spacing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

You'l l notice in examples throughout this book that we indent elements and add spaces wherever it helps make the code more readable to humans. Doing so is not unreasonable if you ever have to edit or inspect XML code personally. Sometimes, however, this indentation can result in space that you don't want in your final product. Since XML has a make-no-assumptions policy toward your data, it may seem that you're stuck with all that space.

One solution is to make the XML processor smarter. Certain parsers can decide whether to pass space along to the processing application. They can determine from the element declarations in the DTD when space is only there for readability and is not part of the content. Alternatively, you can instruct your processor to specialize in a particular markup language and train it to treat some elements differently with respect to space.

When neither option applies to your problem, XML provides a way to let a document tell the processor when space needs to be preserved. The reserved attribute xml:space can be used in any element to specify whether space should be kept as is or removed.

For example:

<address-label xml:space='preserve'>246 Marshmellow Ave.
Slumberville, MA
02149</address-label>

In this case, the characters used to break lines in the address are retained for all future processing. The other setting for xml:space is "default," which means that the XML processor has to decide what to do with extra space.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Entities

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

For your authoring convenience, XML has another feature called entities. An entity is useful when you need a placeholder for text or markup that would be inconvenient or impossible to just type in. It's a piece of XML set aside from your document; you use an entity reference to stand in for it. An XML processor must resolve all entity references with their replacement text at the time of parsing. Therefore, every referenced entity must be declared somewhere so that the processor knows how to resolve it.

The Document Type Declaration (DTD) is the place to declare an entity. It has two parts, the internal subset that is part of your document, and the external subset that lives in another document. (Often, people talk about the external subset as "the DTD" and call the internal subset "the internal subset," even though both subsets together make up the whole DTD.) In both places, the method for declaring entities is the same. The document in Example 2-3 shows how this feature works.

Example 2-3. A document with entity declarations

<!DOCTYPE memo
  SYSTEM "/xml-dtds/memo.dtd"
[
  <!ENTITY companyname "Willy Wonka's Chocolate Factory">
  <!ENTITY healthplan  SYSTEM "hp.txt">
]>
<memo>
  <to>All Oompa-loompas</to>
  <para>
    &companyname; has a new owner and CEO, Charlie Bucket. Since
    our name, &companyname;, has considerable brand recognition,
    the board has decided not to change it. However, at Charlie's
    request, we will be changing our healthcare provider to the
    more comprehensive &Uuml;mpacare, which has better facilities
    for 'Loompas (text of the plan to follow). Thank you for working
    at &companyname;!
  </para>
  &healthplan;
</memo>

Let's examine the new material in this example. At the top is the DTD, a special markup instruction that contains a lot of important information, including the internal subset and a path to the external subset. Like all declarative markup (i.e., it defines something new), it starts with an exclamation point, and is followed by a keyword,

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Unicode, Character Sets, and Encodings

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

At low levels, computers see text as a series of positive integer numbers mapped onto character sets, which are collections of numbered characters (and sometimes control codes) that some standards body created. A very common collection is the venerable US-ASCII character set, which contains 128 characters, including upper- and lowercase letters of the Latin alphabet, numerals, various symbols and space characters, and a few special print codes inherited from the old days of teletype terminals. By adding on the eighth bit, this 7-bit system is extended into a larger set with twice as many characters, such as ISO-Latin1, used in many Unix systems. These characters include other European characters, such as Latin letters with accents, Icelandic characters, ligatures, footnote marks, and legal symbols. Alas, humanity, a species bursting with both creativity and pride, has invented many more linguistic symbols than can be mapped onto an 8-bit number.

For this reason, a new character encoding architecture called Unicode has gained acceptance as the standard way to represent every written script in which people might want to store data (or write computer code). Depending on the flavor used, it uses up to 32 bits to describe a character, giving the standard room for millions of individual glyphs. For over a decade, the Unicode Consortium has been filling up this space with characters ranging from the entire Han Chinese character set to various mathematical, notational, and signage symbols, and still leaves the encoding space with enough room to grow for the coming millennium or two.

Given all this effort we're putting into hyping it, it shouldn't surprise you to learn that, while an XML document can use any type of encoding, it will by default assume the Unicode-flavored, variable-length encoding known as UTF-8. This encoding uses between one and six bytes to encode the number that represents the character's Unicode address and the character's length in bytes, if that address is greater than 255. It's possible to write an entire document in 1-byte characters and have it be indistinguishable from ISO Latin-1 (a humble address block with addresses ranging from 0 to 255), but if you need the occasional high character, or if you need a lot of them (as you would when storing Asian-language data, for example), it's easy to encode in UTF-8. Unicode-aware processors handle the encoding correctly and display the right glyphs, while older applications simply ignore the multibyte characters and pass them through unharmed. Since Version 5.6, Perl has handled UTF-8 characters with increasing finesse. We'll discuss Perl's handling of Unicode in more depth in Chapter 3.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

The XML Declaration

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

After reading about character encodings, an astute reader may wonder how to declare the encoding in the document so an XML processor knows which one you're using. The answer is: declare the encoding in the XML declaration. The XML declaration is a line at the very top of a document that describes the kind of markup you're using, including XML version, character encoding, and whether the document requires an external subset of the DTD. The declaration looks like this:

<?xml version="1.0" encoding="utf8" standalone="yes"?>

The declaration is optional, as are each of its parameters (except for the required version attribute). The encoding parameter is important only if you use a character encoding other than UTF-8 (since it's the default encoding). If explicitly set to "yes", the standalone declaration causes a validating parser to raise an error if the document references external entities.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Processing Instructions and Other Markup

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Besides elements, you can use several other syntactic objects to make XML easier to manage. Processing instructions (PIs) are used to convey information to a particular XML processor. They specify the intended processor with a target parameter, which is followed by an optional data parameter. Any program that doesn't recognize the target simply skips the PI and pretends it never existed. Here is an example based on an actual behind-the-scenes O'Reilly book hacking experience:

<?file-breaker start chap04.xml?><chapter>
<title>The very long title<?lb?>that seemed to go on forever and ever</title>
<?xml2pdf vspace 10pt?>

The first PI has a target called file-breaker and its data is chap04.xml. A program reading this document will look for a PI with that target keyword and will act on that data. In this case, the goal is to create a new file and save the following XML into it.

The second PI has only a target, lb. We have actually seen this example used in documents to tell an XML processor to create a line break at that point. This example has two problems. First, the PI is a replacement for a space character; that's bad because any program that doesn't recognize the PI will not know that a space should be between the two words. It would be better to place a space after the PI and let the target processor remove any following space itself. Second, the target is an instruction, not an actual name of a program. A more unique name like the one in the next PI, xml2pdf, would be better (with the lb appearing as data instead).

PIs are convenient for developers. They have no solid rules that specify how to name a target or what kind of data to use, but in general, target names ought to be very specific and data should be very short.

Those who have written documents using Perl's built-in Plain Old Documentation mini-markup language may note a similarity between PIs and certain POD directives, particularly the =for paragraphs and =begin/=end blocks. In these paragraphs and blocks, you can leave little messages for a POD processor with a target and some arguments (or any string of text).

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Free-Form XML and Well-Formed Documents

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML's grandfather, SGML, required that every element and attribute be documented thoroughly with a long list of declarations in the DTD. We'll describe what we mean by that thorough documentation in the next section, but for now, imagine it as a blueprint for a document. This blueprint adds considerable overhead to the processing of a document and was a serious obstacle to SGML's status as a popular markup language for the Internet. HTML, which was originally developed as an SGML instance, was hobbled by this enforced structure, since any "valid" HTML document had to conform to the HTML DTD. Hence, extending the language was impossible without approval by a web committee.

XML does away with that requirement by allowing a special condition called free-form XML. In this mode, a document has to follow only minimal syntax rules to be acceptable. If it follows those rules, the document is well-formed. Following these rules is wonderfully liberating for a developer because it means that you don't have to scan a DTD every time you want to process a piece of XML. All a processor has to do is make sure that minimal syntax rules are followed.

In free-form XML, you can choose the name of any element. It doesn't have to belong to a sanctioned vocabulary, as is the case with HTML. Including frivolous markup into your program is a risk, but as long as you know what you're doing, it's okay. If you don't trust the markup to fit a pattern you're looking for, then you need to use element and attribute declarations, as we describe in the next section.

What are these rules? Here's a short list as seen though a coarse-grained spyglass:

A document can have only one top-level element, the document element , that contains all the other elements and data. This element does not include the XML declaration and document type declaration, which must precede it.
Every element with content must have both a start tag and an end tag.
Element and attribute names are case sensitive, and only certain characters can be used (letters, underscores, hyphens, periods, and numbers), with only letters and underscores eligible as the first character. Colons are allowed, but only as part of a declared namespace prefix.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Declaring Elements and Attributes

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

When you need an extra level of quality control (beyond the healthful status implied by the "well-formed" label), define the grammar patterns of your markup language in the DTD. Defining the patterns will make your markup into a formal language, documented much like a standard published by an international organization. With a DTD, a program can tell in short order whether a document conforms to, or, as we say, is a valid example of, your document type.

Two kinds of declarations allow a DTD to model a language. The first is the element declaration . It adds a new name to the allowed set of elements and specifies, in a special pattern language, what can go inside the element. Here are some examples:

<!ELEMENT sandwich ((meat | cheese)+ | (peanut-butter, jelly)), condiment+, pickle?)>
<!ELEMENT pickle EMPTY>
<!ELEMENT condiment (PCDATA | mustard | ketchup )*>

The first parameter declares the name of the element. The second parameter is a pattern, a content model in parentheses, or a keyword such as EMPTY. Content models resemble regular expression syntax, the main differences being that element names are complete tokens and a comma is used to indicate a required sequence of elements. Every element mentioned in a content model should be declared somewhere in the DTD.

The other important kind of declaration is the attribute list declaration. With it, you can declare a set of optional or required attributes for a given element. The attribute values can be controlled to some extent, though the pattern restrictions are somewhat limited. Let's look at an example:

<!ATTLIST sandwich
  id        ID        #REQUIRED
  price     CDATA     #IMPLIED
  taste     CDATA     #FIXED     "yummy"
  name      (reuben | ham-n-cheese | BLT | PB-n-J )     'BLT'
>

The general pattern of an attribute declaration has three parts: a name, a data type, and a behavior. This example declares three attributes for the element <sandwich>. The first, named id, is of type ID, which is a unique string of characters that can be used only once in any ID-type attribute throughout the document, and is required because of the

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Schemas

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Several proposed alternate language schemas address the shortcomings of DTD declarations. The W3C's recommended language for doing this is called XML Schema. You should know, however, that it is only one of many competing schema-type languages, some of which may be better suited to your needs. If you prefer to use a competing schema, check CPAN to see if a module has been written to handle your favorite flavor of schemas.

Unlike DTD syntax, XML Schemas are themselves XML documents, making it possible to use many XML tools to edit them. Their real power, however, is in their fine-grained control over the form your data takes. This control makes it more attractive for documents for which checking the quality of data is at least as important as ensuring it has the proper structure. Example 2-4 shows a schema designed to model census forms, where data type checking is necessary.

Example 2-4. An XML schema

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema-instance">
  <xs:annotation>
    <xs:documentation>
      Census form for the Republic of Oz
      Department of Paperwork, Emerald City
    </xs:documentation>
  </xs:annotation>
  <xs:element name="census" type="CensusType"/>
  <xs:complexType name="CensusType">
    <xs:element name="censustaker" type="xs:decimal" minoccurs="0"/>
    <xs:element name="address" type="Address"/>
    <xs:element name="occupants" type="Occupants"/>
    <xs:attribute name="date" type="xs:date"/>
  </xs:complexType>
  <xs:complexType name="Address">
    <xs:element name="number" type="xs:decimal"/>
    <xs:element name="street" type="xs:string"/>
    <xs:element name="city"   type="xs:string"/>
    <xs:element name="province"  type="xs:string"/>
    <xs:attribute name="postalcode" type="PCode"/>
  </xs:complexType>
  <xs:simpleType name="PCode" base="xs:string">
    <xs:pattern value="[A-Z]-d{3}"/>
  </xs:simpleType>
  <xs:complexType name="Occupants">
    <xs:element name="occupant" minOccurs="1" maxOccurs="20">
     <xs:complexType>
      <xs:element name="firstname" type="xs:string"/>
      <xs:element name="surname" type="xs:string"/>
      <xs:element name="age">
       <xs:simpleType base="xs:positive-integer">
        <xs:maxExclusive value="200"/>
       </xs:simpleType>
      </xs:element>
     </xs:complexType>
    </xs:element>
   </xs:complexType>
</xs:schema>

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Transformations

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The last topic we want to introduce is the concept of transformations. In XML, a transformation is a process of restructuring or converting a document into another form. The W3C recommends a language for transforming XML called XML Stylesheet Language for Transformations (XSLT). It's an incredibly useful and fun technology to work with.

Like XML Schema, an XSLT transformation script is an XML document. It's composed of template rules , each of which is an instruction for how to turn one element type into something else. The term template is often used to mean an example of how something should look, with blanks that you should fill in. That's exactly how template rules work: they are examples of how the final document should be, with the blanks filled in by the XSLT processor.

Example 2-5 is a rudimentary transformation that converts a simple DocBook XML document into an HTML page.

Example 2-5. An XSLT transformation document

<xsl:stylesheet 
  xmlns:xsl="https://www.w3.org/1999/XSL/Transform"
  version="1.0">
<xsl:output method="html"/>
<!-- RULE FOR BOOK ELEMENT -->
<xsl:template match="book">
  <html>
    <head>
      <title><xsl:value-of select="title"/></title>
    </head>
    <body>
      <h1><xsl:value-of select="title"/></h1>
      <h3>Table of Contents</h3>
      <xsl:call-template name="toc"/>
      <xsl:apply-templates select="chapter"/>
    </body>
  </html>
</xsl:template>
<!-- RULE FOR CHAPTER -->
<xsl:template match="chapter">
  <xsl:apply-templates/>
</xsl:template>
<!-- RULE FOR CHAPTER TITLE -->
<xsl:template match="chapter/title">
  <h2>
    <xsl:text>Chapter </xsl:text>
    <xsl:number count="chapter" level="any" format="1"/>
  </h2>
  <xsl:apply-templates/>
</xsl:template>
  
<!-- RULE FOR PARA -->
<xsl:template match="para">
  <p><xsl:apply-templates/></p>
</xsl:template>
<!-- NAMED RULE: TOC -->
<xsl:template name="toc">
  <xsl:if test="count(chapter)>0">
    <xsl:for-each select="chapter">
      <xsl:text>Chapter </xsl:text>
      <xsl:value-of select="position(  )"/>
      <xsl:text>: </xsl:text>
      <i><xsl:value-of select="title"/></i>
      <br/>
    </xsl:for-each>
  </xsl:if>
</xsl:template>
</xsl:stylesheet>

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 3: XML Basics: Reading and Writing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

This chapter covers the two most important tasks in working with XML: reading it into memory and writing it out again. XML is a structured, predictable, and standard data storage format, and as such carries a price. Unlike the line-by-line, make-it-up-as-you-go style that typifies text hacking in Perl, XML expects you to learn the rules of its game—the structures and protocols outlined in Chapter 2—before you can play with it. Fortunately, much of the hard work is already done, in the form of module-based parsers and other tools that trailblazing Perl and XML hackers already created (some of which we touched on in Chapter 1).

Knowing how to use parsers is very important. They typically drive the rest of the processing for you, or at least get the data into a state where you can work with it. Any good programmer knows that getting the data ready is half the battle. We'll look deeply into the parsing process and detail the strategies used to drive processing.

Parsers come with a bewildering array of options that let you configure the output to your needs. Which character set should you use? Should you validate the document or merely check if it's well formed? Do you need to expand entity references, or should you keep them as references? How can you set handlers for events or tell the parser to build a tree for you? We'll explain these options fully so you can get the most out of parsing.

Finally, we'll show you how to spit XML back out, which can be surprisingly tricky if one isn't aware of XML's expectations regarding text encoding. Getting this step right is vital if you ever want to be able to use your data again without painful hand fixing.

File I/O is an intrinsic part of any programming language, but it has always been done at a fairly low level: reading a character or a line at a time, running it through a regular expression filter, etc. Raw text is an unruly commodity, lacking any clear rules for how to separate discrete portions, other than basic, flat concepts such as newline-separated lines and tab-separated columns. Consequently, more data packaging schemes are available than even the chroniclers of Babel could have foreseen. It's from this cacophony that XML has risen, providing clear rules for how to create boundaries between data, assign hierarchy, and link resources in a predictable, unambiguous fashion. A program that relies on these rules can read any well-formed XML document, as if someone had jammed a babelfish into its ear.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Parsers

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Where can you get this babelfish to put in your program's ear? An XML parser is a program or code library that translates XML data into either a stream of events or a data object, giving your program direct access to structured data. The XML can come from one or more files or filehandles, a character stream, or a static string. It could be peppered with entity references that may or may not need to be resolved. Some of the parts could come from outside your computer system, living in some far corner of the Internet. It could be encoded in a Latin character set, or perhaps in a Japanese set. Fortunately for you, the developer, none of these details have to be accounted for in your program because they are all taken care of by the parser, an abstract tunnel between the physical state of data and the crystallized representation seen by your subroutines.

An XML parser acts as a bridge between marked-up data (data packaged with embedded XML instructions) and some predigested form your program can work with. In Perl's case, we mean hashes, arrays, scalars, and objects made of references to these old friends. XML can be complex, residing in many files or streams, and can contain unresolved regions (entities) that may need to be patched up. Also, a parser usually tries to accept only good XML, rejecting it if it contains well-formedness errors. Its output has to reflect the structure (order, containment, associative data) while ignoring irrelevant details such as what files the data came from and what character set was used. That's a lot of work. To itemize these points, an XML parser:

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Parser

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Writing a parser requires a lot of work. You can't be sure if you've covered everything without a lot of testing. Unless you're a mutant who loves to write efficient, low-level parser code, your program will probably be slow and resource-intensive. The good news is that a wide variety of free, high quality, and easy-to-use XML parser packages (written by friendly mutants) already exist to help you. People have bashed Perl and XML together for years, and you have a barnful of conveniently pre-invented wheels at your disposal.

Where do Perl programmers go to find ready-made modules to use in their programs? They go to the Comprehensive Perl Archive Network (CPAN), a many-mirrored public resource full of free, open-source Perl code. If you aren't familiar with using CPAN, you must change your isolationist ways and learn to become a programmer of the world. You'll find a multitude of modules authored by folks who have walked the path of Perl and XML before you, and who've chosen to share the tools they've made with the rest of the world.

Don't think of CPAN as a catalog of ready-made solutions for all specific XML problems. Rather, look at it as a toolbox or a source of building blocks you can assemble and configure to craft a solution. While some modules specialize in popular XML applications like RSS and SOAP, most are more general-purpose. Chances are, you won't find a module that specifically addresses your needs. You'll more likely take one of the general XML modules and adapt it somehow. We'll show that this process is painless and reveal several ways to configure general modules to your particular application.

XML parsers differ from one another in two major ways. First, they differ in their parsing style , which is how the parser works with XML. There are a few different strategies, such as building a data structure or creating an event stream. Another attribute of parsers, called standards-completeness , is a spectrum ranging from ad hoc on one extreme to an exhaustive, standards-based solution on the other. The balance on the latter axis is slowly moving from the eccentric, nonstandard side toward the other end as the Perl community agrees on how to implement major standards like SAX and DOM.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Stream-Based Versus Tree-Based Processing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Remember the Perl mantra, "There's more than one way to do it"? It is also true when working with XML. Depending on how you want to work and what kind of resources you have, many options are available. One developer may prefer a low-maintenance parsing job and is prepared to be loose and sloppy with memory to get it. Another will need to squeeze out faster and leaner performance at the expense of more complex code. XML processing tasks vary widely, so you should be free to choose the shortest path to a solution.

There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. We call this pipeline an event stream because each chunk of data sent to the program signals something new and interesting in the XML stream. For example, the beginning of a new element is a significant event. So is the discovery of a processing instruction in the markup. With each update, your program does something new—perhaps translating the data and sending it to another place, testing it for some specific content, or sticking it onto a growing heap of data.

With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. Instead of a pipeline, it's like a camera that takes a picture and transmits the replica to you. The model is usually in a much more convenient state than raw XML. For example, nested elements may be represented in native Perl structures like lists or hashes, as we saw in an earlier example. Even more useful are trees of blessed objects with methods that help navigate the structure from one place to another. The whole point to this strategy is that your program can pull out any data it needs, in any order.

Why would you prefer one over the other? Each has strong and weak points. Event streams are fast and often have a much slimmer memory footprint, but at the expense of greater code complexity and impermanent data. Tree building, on the other hand, lets the data stick around for as long as you need it, and your code is usually simple because you don't need special tricks to do things like backwards searching. However, trees wither when it comes to economical use of processor time and memory.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Putting Parsers to Work

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Enough tinkering with the parser's internal details. We want to see what you can do with the stuff you get from parsers. We've already seen an example of a complete, parser-built tree structure in Example 3-3, so let's do something with the other type. We'll take an XML event stream and make it drive processing by plugging it into some code to handle the events. It may not be the most useful tool in the world, but it will serve well enough to show you how real-world XML processing programs are written.

XML::Parser (with Expat running underneath) is at the input end of our program. Expat subscribes to the event-based parsing school we described earlier. Rather than loading your whole XML document into memory and then turning around to see what it hath wrought, it stops every time it encounters a discrete chunk of data or markup, such as an angle-bracketed tag or a literal string inside an element. It then checks to see if our program wants to react to it in any way.

Your first responsibility is to give the parser an interface to the pertinent bits of code that handle events. Each type of event is handled by a different subroutine, or handler . We register our handlers with the parser by setting the Handlers option at initialization time. Example 3-5 shows the entire process.

Example 3-5. A stream-based XML processor

use XML::Parser;
# initialize the parser
my $parser = XML::Parser->new( Handlers => 
                                     {
                                      Start=>\&handle_start,
                                      End=>\&handle_end,
                                     });
$parser->parsefile( shift @ARGV );
my @element_stack;                # remember which elements are open
# process a start-of-element event: print message about element
#
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    # ask the expat object about our position
    my $line = $expat->current_line;
    print "I see an $element element starting on line $line!\n";
    # remember this element and its starting position by pushing a
    # little hash onto the element stack
    push( @element_stack, { element=>$element, line=>$line });
    if( %attrs ) {
        print "It has these attributes:\n";
        while( my( $key, $value ) = each( %attrs )) {
            print "\t$key => $value\n";
        }
    }
}
# process an end-of-element event
#
sub handle_end {
    my( $expat, $element ) = @_;
    # We'll just pop from the element stack with blind faith that
    # we'll get the correct closing element, unlike what our
    # homebrewed well-formedness did, since XML::Parser will scream
    # bloody murder if any well-formedness errors creep in.
    my $element_record = pop( @element_stack );
    print "I see that $element element that started on line ",
          $$element_record{ line }, " is closing now.\n";
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::LibXML

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML::LibXML , like XML::Parser, is an interface to a library written in C. Called libxml2 , it's part of the GNOME project. Unlike XML::Parser, this new parser supports a major standard for XML tree processing known as the Document Object Model ( DOM).

DOM is another much-ballyhooed XML standard. It does for tree processing what SAX does for event streams. If you have your heart set on climbing trees in your program and you think there's a likelihood that it might be reused or applied to different data sources, you're better off using something standard and interchangeable. Again, we're happy to delve into DOM in a future chapter and get you thinking in standards-compliant ways. That topic is coming up in Chapter 7.

Now we want to show you an example of another parser in action. We'd be remiss if we focused on just one kind of parser when so many are out there. Again, we'll show you a basic example, nothing fancy, just to show you how to invoke the parser and tame its power. Let's write another document analysis tool like we did in Example 3-5, this time printing a frequency distribution of elements in a document.

Example 3-6 shows the program. It's a vanilla parser run because we haven't set any options yet. Essentially, the parser parses the filehandle and returns a DOM object, which is nothing more than a tree structure of well-designed objects. Our program finds the document element, and then traverses the entire tree one element at a time, all the while updating the hash of frequency counters.

Example 3-6. A frequency distribution program

use XML::LibXML;
use IO::Handle;
# initialize the parser
my $parser = new XML::LibXML;
# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    my %dist;
    &proc_node( $doc->getDocumentElement, \%dist );
    foreach my $item ( sort keys %dist ) {
        print "$item: ", $dist{ $item }, "\n";
    }
    $fh->close;
}
# process an XML tree node: if it's an element, update the
# distribution list and process all its children
#
sub proc_node {
    my( $node, $dist ) = @_;
    return unless( $node->nodeType eq XML_ELEMENT_NODE () );
    $dist->{ $node->nodeName } ++;
    foreach my $child ( $node->getChildnodes ) {
        &proc_node( $child, $dist );
    }
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::XPath

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

We've seen examples of parsers that dutifully deliver the entire document to you. Often, though, you don't need the whole thing. When you query a database, you're usually looking for only a single record. When you crack open a telephone book, you're not going to sit down and read the whole thing. There is obviously a need for some mechanism of extracting a specific piece of information from a vast document. Look no further than XPath.

XPath is a recommendation from the folks who brought you XML. It's a grammar for writing expressions that pinpoint specific pieces of documents. Think of it as an addressing scheme. Although we'll save the nitty-gritty of XPath wrangling for Chapter 8, we can tantalize you by revealing that it works much like a mix of regular expressions with Unix-style file paths. Not surprisingly, this makes it an attractive feature to add to parsers.

Matt Sergeant's XML::XPath module is a solid implementation, built on the foundation of XML::Parser. Given an XPath expression, it returns a list of all document parts that match the description. It's an incredibly simple way to perform some powerful search and retrieval work.

For instance, suppose we have an address book encoded in XML in this basic form:

<contacts>
  <entry>
    <name>Bob Snob</name>
    <street>123 Platypus Lane</street>
    <city>Burgopolis</city>
    <state>FL</state>
    <zip>12345</zip>
  </entry>
 <!--More entries go here-->
</contacts>

Suppose you want to extract all the zip codes from the file and compile them into a list. Example 3-7 shows how you could do it with XPath.

Example 3-7. Zip code extractor

use XML::XPath;
my $file = 'customers.xml';
my $xp = XML::XPath->new(filename=>$file);
# An XML::XPath nodeset is an object which contains the result of
# smacking an XML document with an XPath expression; we'll do just
# this, and then query the nodeset to see what we get.
my $nodeset = $xp->find('//zip');
my @zipcodes;                   # Where we'll put our results
if (my @nodelist = $nodeset->get_nodelist) {
  # We found some zip elements! Each node is an object of the class
  # XML::XPath::Node::Element, so I'll use that class's 'string_value'
  # method to extract its pertinent text, and throw the result for all
  # the nodes into our array.
  @zipcodes = map($_->string_value, @nodelist);
  # Now sort and prepare for output
  @zipcodes = sort(@zipcodes);
  local $" = "\n";
  print "I found these zipcodes:\n@zipcodes\n";
} else {
  print "The file $file didn't have any 'zip' elements in it!\n";
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Document Validation

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Being well-formed is a minimal requirement for XML everywhere. However, XML processors have to accept a lot on blind faith. If we try to build a document to meet some specific XML application's specifications, it doesn't do us any good if a content generator slips in a strange element we've never seen before and the parser lets it go by with nary a whimper. Luckily, a higher level of quality control is available to us when we need to check for things like that. It's called document validation.

Validation is a sophisticated way of comparing a document instance against a template or grammar specification. It can restrict the number and type of elements a document can use and control where they go. It can even regulate the patterns of character data in any element or attribute. A validating parser tells you whether a document is valid or not, when given a DTD or schema to check against.

Remember that you don't need to validate every XML document that passes over your desk. DTDs and other validation schemes shine when working with specific XML-based markup languages (such as XHTML for web pages, MathML for equations, or CaveML for spelunking), which have strict rules about which elements and attributes go where (because having an automated way to draw attention to something fishy in the document structure becomes a feature).

However, validation usually isn't crucial when you use Perl and XML to perform a less specific task, such as tossing together XML documents on the fly based on some other, less sane data format, or when ripping apart and analyzing existing XML documents.

Basically, if you feel that validation is a needless step for the job at hand, you're probably right. However, if you knowingly generate or modify some flavor of XML that needs to stick to a defined standard, then taking the extra step or three necessary to perform document validation is probably wise. Your toolbox, naturally, gives you lots of ways to do this. Read on.

Document type descriptions (DTDs) are documents written in a special markup language defined in the XML specification, though they themselves are not XML. Everything within these documents is a declaration starting with a

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Writer

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Compared to all we've had to deal with in this chapter so far, writing XML will be a breeze. It's easier to write it because now the shoe's on the other foot: your program has a data structure over which it has had complete control and knows everything about, so it doesn't need to prepare for every contingency that it might encounter when processing input.

There's nothing particularly difficult about generating XML. You know about elements with start and end tags, their attributes, and so on. It's just tedious to write an XML output method that remembers to cross all the t's and dot all the i's. Does it put a space between every attribute? Does it close open elements? Does it put that slash at the end of empty elements? You don't want to have to think about these things when you're writing more important code. Others have written modules to take care of these serialization details for you.

David Megginson's XML::Writer is a fine example of an abstract XML generation interface. It comes with a handful of very simple methods for building any XML document. Just create a writer object and call its methods to crank out a stream of XML. Table 3-1 lists some of these methods.

Table 3-1: XML::Writer methods
Name	Function
`end( )`	Close the document and perform simple well-formedness checking (e.g., make sure that there is one root element and that every start tag has an associated end tag). If the option `UNSAFE` is set, however, most well-formedness checking is skipped.
`xmlDecl([$encoding, $standalone])`	Add an XML Declaration at the top of the document. The version is hard-wired as "1.0".

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Character Sets and Encodings

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

No matter how you choose to manage your program's output, you must keep in mind the concept of character encoding—the protocol your output XML document uses to represent the various symbols of its language, be they an alphabet of letters or a catalog of ideographs and diacritical marks. Character encoding may represent the trickiest part of XML-slinging, perhaps especially so for programmers in Western Europe and the Americas, most of whom have not explored the universe of possible encodings beyond the 128 characters of ASCII.

While it's technically legal for an XML document's encoding declaration to contain the name of any text encoding scheme, the only ones that XML processors are, according to spec, required to understand are UTF-8 and UTF-16. UTF-8 and UTF-16 are two flavors of Unicode , a recent and powerful character encoding architecture that embraces every funny little squiggle a person might care to make.

In this section, we conspire with Perl and XML to nudge you gently into thinking about Unicode, if you're not pondering it already. While you can do everything described in this book by using the legacy encoding of your choice, you'll find, as time passes, that you're swimming against the current.

Unicode has crept in as the digital age's way of uniting the thousands of different writing systems that have paid the salaries of monks and linguists for centuries. Of course, if you program in an environment where non-ASCII characters are found in abundance, you're probably already familiar with it. However, even then, much of your text processing work might be restricted to low-bit Latin alphanumerics, simply because that's been the character set of choice—of fiat, really—for the Internet. Unicode hopes to change this trend, Perl hopes to help, and sneaky little XML is already doing so.

As any Unicode-evangelizing document will tell you, Unicode is great for internationalizing code. It lets programmers come up with localization solutions without the additional worry of juggling different character architectures.

However, Unicode's importance increases by an order of magnitude when you introduce the question of data representation. The languages that a given program's users (or programmers) might prefer is one thing, but as computing becomes more ubiquitous, it touches more people's lives in more ways every day, and some of these people speak Kurku. By understanding the basics of Unicode, you can see how it can help to transparently keep all the data you'll ever work with, no matter the script, in one architecture.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 4: Event Streams

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Now that you're all warmed up with parsers and have enough knowledge to make you slightly dangerous, we'll analyze one of the two important styles of XML processing: event streams. We'll look at some examples that show the basic theory of stream processing and graduate with a full treatment of the standard Simple API for XML (SAX).

In the world of computer science, a stream is a sequence of data chunks to be processed. A file, for example, is a sequence of characters (one or more bytes each, depending on the encoding). A program using this data can open a filehandle to the file, creating a character stream, and it can choose to read in data in chunks of whatever size it chooses. Streams can be dynamically generated too, whether from another program, received over a network, or typed in by a user. A stream is an abstraction, making the source of the data irrelevant for the purpose of processing.

To summarize, here are a stream's important qualities:

It consists of a sequence of data fragments.
The order of fragments transmitted is significant.
The source of data (e.g., file or program output) is not important.

XML streams are just clumpy character streams. Each data clump, called a token in parser parlance, is a conglomeration of one or more characters. Each token corresponds to a type of markup, such as an element start or end tag, a string of character data, or a processing instruction. It's very easy for parsers to dice up XML in this way, requiring minimal resources and time.

What makes XML streams different from character streams is that the context of each token matters; you can't just pump out a stream of random tags and data and expect an XML processor to make sense of it. For example, a stream of ten start tags followed by no end tags is not very useful, and definitely not well-formed XML. Any data that isn't well-formed will be rejected. After all, the whole purpose of XML is to package data in a way that guarantees the integrity of a document's structure and labeling, right?

These contextual rules are helpful to the parser as well as the front-end processor. XML was designed to be very easy to parse, unlike other markup languages that can require look-ahead or look-behind. For example, SGML does not have a rule requiring nonempty elements to have an end tag. To know when an element ends requires sophisticated reasoning by the parser. This requirement leads to code complexity, slower processing speed, and increased memory usage.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Working with Streams

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

To summarize, here are a stream's important qualities:

It consists of a sequence of data fragments.
The order of fragments transmitted is significant.
The source of data (e.g., file or program output) is not important.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Events and Handlers

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Why do we call it an event stream and not an element stream or a markup object stream? The fact that XML is hierarchical (elements contain other elements) makes it impossible to package individual elements and serve them up as tokens in the stream. In a well-formed document, all elements are contained in one root element. A root element that contains the whole document is not a stream. Thus, we really can't expect a stream to give a complete element in a token, unless it's an empty element.

Instead, XML streams are composed of events. An event is a signal that the state of the document (as we've seen it so far in the stream) has changed. For example, when the parser comes across the start tag for an element, it indicates that another element was opened and the state of parsing has changed. An end tag affects the state by closing the most recently opened element. An XML processor can keep track of open elements in a stack data structure, pushing newly opened elements and popping off closed ones. At any given moment during parsing, the processor knows how deep it is in the document by the size of the stack.

Though parsers support a variety of events, there is a lot of overlap. For example, one parser may distinguish between a start tag and an empty element, while another may not, but all will signal the presence of that element. Let's look more closely at how a parser might dole out tokens, as shown Example 4-1.

Example 4-1. XML fragment

<recipe>
  <name>peanut butter and jelly sandwich</name>
  <!-- add picture of sandwich here -->
  <ingredients>
    <ingredient>Gloppy&trade; brand peanut butter</ingredient>
    <ingredient>bread</ingredient>
    <ingredient>jelly</ingredient>
  </ingredients>
  <instructions>
    <step>Spread peanutbutter on one slice of bread.</step>
    <step>Spread jelly on the other slice of bread.</step>
    <step>Put bread slices together, with peanut butter and
  jelly touching.</step>
  </instructions>
</recipe>

Apply a parser to the preceding example and it might generate this list of events:

A document start (if this is the beginning of a document and not a fragment)

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

The Parser as Commodity

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

You don't have to write an XML processing program that separates parser from handler, but doing so can be advantageous. By making your program modular, you make it easier to organize and test your code. The ideal way to modularize is with objects, communicating on sanctioned channels and otherwise leaving one another alone. Modularization makes swapping one part for another easier, which is very important in XML processing.

The XML stream, as we said before, is an abstraction, which makes the source of data irrelevant. It's like the spigot you have in the backyard, to which you can hook up a hose and water your lawn. It doesn't matter where you plug it in, you just want the water. There's nothing special about the hose either. As long as it doesn't leak and it reaches where you want to go, you don't care if it's made of rubber or bark. Similarly, XML parsers have become a commodity: something you can download, plug in, and see it work as expected. Plugging it in easily, however, is the tricky part.

The key is the screwhead on the end of the spigot. It's a standard gauge of pipe that uses a specific thread size, and any hose you buy should fit. With XML event streams, we also need a standard interface there. XML developers have settled on SAX, which has been in use for a few years now. Until recently, Perl XML parsers were not interchangeable. Each had its own interface, making it difficult to swap out one in favor of another. That's changing now, as developers adopt SAX and agree on conventions for hooking up handlers to parsers. We'll see some of the fruits of this effort in Chapter 5.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Stream Applications

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Stream processing is great for many XML tasks. Here are a few of them:

Filter: A filter outputs an almost identical copy of the source document, with a few small changes. Every incidence of an <A> element might be converted into a <B> element, for example. The handler is simple, as it has to output only what it receives, except to make a subtle change when it detects a specific event.
Selector: If you want a specific piece of information from a document, without the rest of the content, you can write a selector program. This program combs through events, looking for an element or attribute containing a particular bit of unique data called a key, and then stops. The final job of the program is to output the sought-after record, possibly reformatted.
Summarizer: This program type consumes a document and spits out a short summary. For example, an accounting program might calculate a final balance from many transaction records; a program might generate a table of contents by outputting the titles of sections; an index generator might create a list of links to certain keywords highlighted in the text. The handler for this kind of program has to remember portions of the document to repackage it after the parser is finished reading the file.
Converter: This sophisticated type of program turns your XML-formatted document into another format—possibly another application of XML. For example, turning DocBook XML into HTML can be done in this way. This kind of processing pushes stream processing to its limits.

XML stream processing works well for a wide variety of tasks, but it does have limitations. The biggest problem is that everything is driven by the parser, and the parser has a mind of its own. Your program has to take what it gets in the order given. It can't say, "Hold on, I need to look at the token you gave me ten steps back" or "Could you give me a sneak peek at a token twenty steps down the line?" You can look back to the parsing past by giving your program a memory. Clever use of data structures can be used to remember recent events. However, if you need to look behind a lot, or look ahead even a little, you probably need to switch to a different strategy: tree processing, the topic of Chapter 6.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::PYX

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

In the Perl universe, standard APIs have been slow to catch on for many reasons. CPAN, the vast storehouse of publicly offered modules, grows organically, with no central authority to approve of a submission. Also, with XML, a relative newcomer on the data format scene, the Perl community has only begun to work out standard solutions.

We can characterize the first era of XML hacking in Perl to be the age of nonstandard parsers. It's a time when documentation is scarce and modules are experimental. There is much creativity and innovation, and just as much idiosyncrasy and quirkiness. Surprisingly, many of the tools that first appeared on the horizon were quite useful. It's fascinating territory for historians and developers alike.

XML::PYX is one of these early parsers. Streams naturally lend themselves to the concept of pipelines, where data output from one program can be plugged into another, creating a chain of processors. There's no reason why XML can't be handled that way, so an innovative and elegant processing style has evolved around this concept. Essentially, the XML is repackaged as a stream of easily recognizable and transmutable symbols, even as a command-line utility.

One example of this repackaging is PYX, a symbolic encoding of XML markup that is friendly to text processing languages like Perl. It presents each XML event on a separate line very cleverly. Many Unix programs like awk and grep are line oriented, so they work well with PYX. Lines are happy in Perl too.

Table 4-1 summarizes the notation of PYX.

Table 4-1: PYX notation
Symbol	Represents
(	An element start tag
)	An element end tag

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Parser

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Another early parser is XML::Parser , the first fast and efficient parser to hit CPAN. We detailed its many-faceted interface in Chapter 3. Its built-in stream mode is worth a closer look, though. Let's return to it now with a solid stream example.

We'll use XML::Parser to read a list of records encoded as an XML document. The records contain contact information for people, including their names, street addresses, and phone numbers. As the parser reads the file, our handler will store the information in its own data structure for later processing. Finally, when the parser is done, the program sorts the records by the person's name and outputs them as an HTML table.

The source document is listed in Example 4-3. It has a <list> element as the root, with four <entry> elements inside it, each with an address, a name, and a phone number.

Example 4-3. Address book file

<list>
  <entry>
    <name><first>Thadeus</first><last>Wrigley</last></name>
    <phone>716-505-9910</phone>
    <address>
      <street>105 Marsupial Court</street>
      <city>Fairport</city><state>NY</state><zip>14450</zip>
    </address>
  </entry>
  <entry>
    <name><first>Jill</first><last>Baxter</last></name>
    <address>
      <street>818 S. Rengstorff Avenue</street>
      <zip>94040</zip>
      <city>Mountainview</city><state>CA</state>
    </address>
    <phone>217-302-5455</phone>
  </entry>
  <entry>
    <name><last>Riccardo</last>
    <first>Preston</first></name>
    <address>
      <street>707 Foobah Drive</street>
      <city>Mudhut</city><state>OR</state><zip>32777</zip>
    </address>
    <phone>111-222-333</phone>
  </entry>
  <entry>
    <address>
      <street>10 Jiminy Lane</street>
      <city>Scrapheep</city><state>PA</state><zip>99001</zip>
    </address>
    <name><first>Benn</first><last>Salter</last></name>
    <phone>611-328-7578</phone>
  </entry>
</list>

This simple structure lends itself naturally to event processing. Each <entry> start tag signals the preparation of a new part of the data structure for storing data. An </entry> end tag indicates that all data for the record has been collected and can be saved. Similarly, start and end tags for

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 5: SAX

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML::Parser has done remarkably well as a multipurpose XML parser and stream generator, but it really isn't the future of Perl and XML. The problem is that we don't want one standard parser for all ends and purposes; we want to be able to choose from multiple parsers, each serving a different purpose. One parser might be written completely in Perl for portability, while another is accelerated with a core written in C. Or, you might want a parser that translates one format (such as a spreadsheet) into an XML stream. You simply can't anticipate all the things a parser might be called on to do. Even XML::Parser, with its many options and multiple modes of operation, can't please everybody. The future, then, is a multiplicity of parsers that cover any situation you encounter.

An environment with multiple parsers demands some level of consistency. If every parser had its own interface, developers would go mad. Learning one interface and being able to expect all parsers to comply to that is better than having to learn a hundred different ways to do the same thing. We need a standard interface between parsers and code: a universal plug that is flexible and reliable, free from the individual quirks of any particular parser.

The XML development world has settled on an event-driven interface called SAX. SAX evolved from discussions on the XML-DEV mailing list and, shepherded by David Megginson, was quickly shaped into a useful specification. The first incarnation, called SAX Level 1 (or just SAX1), supports elements, attributes, and processing instructions. It doesn't handle some other things like namespaces or CDATA sections, so the second iteration, SAX2, was devised, adding support for just about any event you can imagine in generic XML.

SAX has been a huge success. Its simplicity makes it easy to learn and work with. Early development with XML was mostly in the realm of Java, so SAX was codified as an interface construct. An interface construct is a special kind of class that declares an object's methods without implementing them, leaving the implementation up to the developer.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

SAX Event Handlers

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

To use a typical SAX module in a program, you must pass it an object whose methods implement handlers for SAX events. Table 5-1 describes the methods in a typical handler object. A SAX parser passes a hash to each handler containing properties relevant to the event. For example, in this hash, an element handler would receive the element's name and a list of attributes.

Table 5-1: PerlSAX handlers
Method name	Event	Properties
`start_document`	The document processing has started (this is the first event)	(none defined)
`end_document`	The document processing is complete (this is the last event)	(none defined)
`start_element`	An element start tag or empty element tag was found	Name, Attributes
`end_element`	An element end tag or empty element tag was found	Name
`characters`	A string of nonmarkup characters (character data) was found

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

DTD Handlers

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML::Parser::PerlSAX supports another group of handlers used to process DTD events . It takes care of anything that appears before the root element, such as the XML declaration, doctype declaration, and the internal subset of entity and element declarations, which are collectively called the document prolog . If you want to output the document literally as you read it (e.g., in a filter program), you need to define some of these handlers to reproduce the document prolog. Defining these handlers is just what we needed in the previous example.

You can use these handlers for other purposes. For example, you may need to pre-load entity definitions for special processing rather than rely on the parser to do its default substitution for you. These handlers are listed in Table 5-2.

Table 5-2: PerlSAX DTD handlers
Method name	Event	Properties
`entity_decl`	The parser sees an entity declaration (internal or external, parsed or unparsed).	`Name, Value, PublicId, SystemId, Notation`
`notation_decl`	The parser found a notation declaration.	`Name, PublicId, SystemId, Base`
`unparsed_entity_decl`	The parser found a declaration for an unparsed entity (e.g., a binary data entity).

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

External Entity Resolution

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

By default, the parser substitutes all entity references with their actual values for you. Usually that's what you want it to do, but sometimes, as in the case with our filter example, you'd rather keep the entity references in place. As we saw, keeping the entity references is pretty easy to do; just include an entity_reference( ) handler method to override that behavior by outputting the references again. What we haven't seen yet is how to override the default handling of external entity references. Again, the parser wants to replace the references with their values by locating the files and inserting their contents into the stream. Would you ever want to change that behavior, and if so, how would you do it?

Storing documents in multiple files is convenient, especially for really large documents. For example, suppose you have a big book to write in XML and you want to store each chapter in its own file. You can do so easily with external entities. Here's an example:

<?xml version="1.0"?>
<doctype book [
  <!ENTITY intro-chapter   SYSTEM "chapters/intro.xml">
  <!ENTITY pasta-chapter   SYSTEM "chapters/pasta.xml">
  <!ENTITY stirfry-chapter SYSTEM "chapters/stirfry.xml">
  <!ENTITY soups-chapter   SYSTEM "chapters/soups.xml"> ]>
<book>
  <title>The Bonehead Cookbook</title>
  &intro-chapter;
  &pasta-chapter;
  &stirfry-chapter;
  &soups-chapter;
</book>

The previous filter example would resolve the external entity references for you diligently and output the entire book in one piece. Your file separation scheme would be lost and you'd have to edit the resulting file to break it back into multiple files. Fortunately, we can override the resolution of external entity references using a handler called resolve_entity( ).

This handler has four properties: Name, the entity's name; SystemId and PublicId, identifiers that help you locate the file containing the entity's text; and Base, which helps resolve relative URLs, if any exist. Unlike the other handlers, this one should return a value to tell the parser what to do. Returning undef tells the parser to load the external entity as it normally would. Otherwise, you need to return a hash describing an alternative source from which the entity should be loaded. The hash is the same type you would use to give to the object's

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Drivers for Non-XML Sources

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The filter example used a file containing an XML document as an input source. This example shows just one of many ways to use SAX. Another popular use is to read data from a driver, which is a program that generates a stream of data from a non-XML source, such as a database. A SAX driver converts the data stream into a sequence of SAX events that we can process the way we did previously. What makes this so cool is that we can use the same code regardless of where the data came from. The SAX event stream abstracts the data and markup so we don't have to worry about it. Changing the program to work with files or other drivers would be trivial.

To see a driver in action, we will write a program that uses Ilya Sterin's module XML::SAXDriver::Excel to convert Microsoft Excel spreadsheets into XML documents. This example shows how a data stream can be processed in a pipeline fashion to ultimately arrive in the form we want it. A Spreadsheet::ParseExcel object reads the file and generates a generic data stream, which an XML::SAXDriver::Excel object translates into a SAX event stream. This stream is then output as XML by our program.

Here's a test Excel spreadsheet, represented as a table:

	A	B
1	baseballs	55
2	tennisballs	33
3	pingpong balls	12
4	footballs	77

The SAX driver will create new elements for us, giving us the names in the form of arguments to handler method calls. We will just print them out as they come and see how the driver structures the document. Example 5-6 is a simple program that does this.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

A Handler Base Class

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

SAX doesn't distinguish between different elements; it leaves that burden up to you. You have to sort out the element name in the

start_element(
)

handler, and maybe use a stack to keep track of element hierarchy. Don't you wish there were some way to abstract that stuff? Ken MacLeod has done just that with his XML::Handler::Subs module.

This module defines an object that branches handler calls to more specific handlers. If you want a handler that deals only with <title> elements, you can write that handler and it will be called. The handler dealing with a start tag must begin with s_, followed by the element's name (replace special characters with an underscore). End tag handlers are the same, but start with e_ instead of s_.

That's not all. The base object also has a built-in stack and provides an accessor method to check if you are inside a particular element. The $self->{Names} variable refers to a stack of element names. Use the method in_element( $name ) to test whether the parser is inside an element named $name at any point in time.

To try this out, let's write a program that does something element-specific. Given an HTML file, the program outputs everything inside an <h1> element, even inline elements used for emphasis. The code, shown in Example 5-7, is breathtakingly simple.

Example 5-7. A program subclassing the handler base

use XML::Parser::PerlSAX;
use XML::Handler::Subs
#
# initialize the parser
#
use XML::Parser::PerlSAX;
my $parser = XML::Parser::PerlSAX->new( Handler => H1_grabber->new(  ) );
$parser->parse( Source => {SystemId => shift @ARGV} );
## Handler object: H1_grabber
##
package H1_grabber;
use base( 'XML::Handler::Subs' );
sub new {
    my $type = shift;
    my $self = {@_};
    return bless( $self, $type );
}
#
# handle start of document
#
sub start_document {
  SUPER::start_document(  );
  print "Summary of file:\n";
}
#
# handle start of <h1>: output bracket as delineator
#
sub s_h1 {
  print "[";
}
#
# handle end of <h1>: output bracket as delineator
#
sub e_h1 {
  print "]\n";
}
#
# handle character data
#
sub characters {
  my( $self, $props ) = @_;
  my $data = $props->{Data};
  print $data if( $self->in_element( 'h1' ));
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Handler::YAWriter as a Base Handler Class

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Michael Koehne's XML::Handler::YAWriter serves as the "yet another" XML writer it bills itself as, but in doing so also sets itself up as a handy base class for all sorts of SAX-related work.

If you've ever worked with Perl's various Tie::* base classes, the idea is similar: you start out with a base class with callbacks defined that don't do anything very exciting, but by their existence satisfy all the subroutine calls triggered by SAX events. In your own driver class, you simply redefine the subroutines that should do something special and let the default behavior rule for all the events you don't care much about.

The default behavior, in this case, gives you something nice, too: access to an array of strings (stored as an instance variable on the handler object) holding the XML document that the incoming SAX events built. This isn't necessarily very interesting if your data source was XML, but if you use a PerlSAXish driver to generate an event stream out of an unsuspecting data source, then this feature is lovely. It gives you an easy way to, for instance, convert a non-XML file into its XML equivalent and save it to disk.

The trade-off is that you must remember to invoke $self->SUPER::[methodname] with all your own event handler methods. Otherwise, your class may forget its roots and fail to add things to that internal strings array in its youthful naïveté, and thus leave embarrassing holes in the generated XML document.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::SAX: The Second Generation

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The proliferation of SAX parsers presents two problems: how to keep them all synchronized with the standard API and how to keep them organized on your system. XML::SAX, a marvelous team effort by Matt Sergeant, Kip Hampton, and Robin Berjon, solves both problems at once. As a bonus, it also includes support for SAX Level 2 that previous modules lacked.

"What," you ask, "do you mean about keeping all the modules synchronized with the API?" All along, we've touted the wonders of using a standard like SAX to ensure that modules are really interchangeable. But here's the rub: in Perl, there's more than one way to implement SAX. SAX was originally designed for Java, which has a wonderful interface type of class that nails down things like what type of argument to pass to which method. There's nothing like that in Perl.

This wasn't as much of a problem with the older SAX modules we've been talking about so far. They all support SAX Level 1, which is fairly simple. However, a new crop of modules that support SAX2 is breaking the surface. SAX2 is more complex because it introduces namespaces to the mix. An element event handler should receive both the namespace prefix and the local name of the element. How should this information be passed in parameters? Do you keep them together in the same string like foo:bar? Or do you separate them into two parameters?

This debate created a lot of heat on the perl-xml mailing list until a few members decided to hammer out a specification for "Perlish" SAX (we'll see in a moment how to use this new API for SAX2). To encourage others to adhere to this convention, XML::SAX includes a class called XML::SAX::ParserFactory . A factory is an object whose sole purpose is to generate objects of a specific type—in this case, parsers. XML::SAX::ParserFactory is a useful way to handle housekeeping chores related to the parsers, such as registering their options and initialization requirements. Tell the factory what kind of parser you want and it doles out a copy to you.

XML::SAX represents a shift in the way XML and Perl work together. It builds on the work of the past, including all the best features of previous modules, while avoiding many of the mistakes. To ensure that modules are truly compatible, the kit provides a base class for parsers, abstracting out most of the mundane work that all parsers have to do, leaving the developer the task of doing only what is unique to the task. It also creates an abstract interface for users of parsers, allowing them to keep the plethora of modules organized with a registry that is indexed by properties to make it easy to find the right one with a simple query. It's a bold step and carries a lot of heft, so be prepared for a lot of information and detail in this section. We think it will be worth your while.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 6: Tree Processing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Having done just about all we can do with streams, it's time to move on to another style of XML processing. Instead of letting the XML fly past the program one tiny piece at a time, we will capture the whole document in memory and then start working on it. Having an in-memory representation built behind the scenes for us makes our job much easier, although it tends to require more memory and CPU cycles.

This chapter is an overview of programming with persistent XML objects, better known as tree processing. It looks at a variety of different modules and strategies for building and accessing XML trees, including the rigorous, standard Document Object Model (DOM), fast access to internal document parts with XPath, and efficient tree processing methods.

Every XML document can be represented as a collection of data objects linked in an acyclic structure called a tree. Each object, or node , is a small piece of the document, such as an element, a piece of text, or a processing instruction. One node, called the root, links to other nodes, and so on down to nodes that aren't linked to anything. Graph this image out and it looks like a big, bushy tree—hence the name.

A tree structure representing a piece of XML is a handy thing to have. Since a tree is acyclic (it has no circular links), you can use simple traversal methods that won't get stuck in infinite loops. Like a filesystem directory tree, you can represent the location of a node easily in simple shorthand. Like real trees, you can break a piece off and treat it like a smaller tree—a tree is just a collection of subtrees joined by a root node. Best of all, you have all the information in one place and search through it like a database.

For the programmer, a tree makes life much easier. Stream processing, you will recall, remembers fleeting details to use later in constructing another data structure or printing out information. This work is tedious, and can be downright horrible for very complex documents. If you have to combine pieces of information from different parts of the document, then you might go mad. If you have a tree containing the document, though, all the details are right in front of you. You only need to write code to sift through the nodes and pull out what you need.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Trees

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Of course, you don't get anything good for free. There is a penalty for having easy access to every point in a document. Building the tree in the first place takes time and precious CPU cycles, and even more if you use object-oriented method calls. There is also a memory tax to pay, since each object in the tree takes up some space. With very large documents (trees with millions of nodes are not unheard of), you could bring your poor machine down to its knees with a tree processing program. On the average, though, processing trees can get you pretty good results (especially with a little optimizing, as we show later in the chapter), so don't give up just yet.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Simple

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The simplest tree model can be found in Grant McLean's module XML::Simple. It's designed to facilitate the job of reading and saving datafiles. The programmer doesn't have to know much about XML and parsers—only how to access arrays and hashes, the data structures used to store a document.

Example 6-1 shows a simple datafile that a program might use to store information.

Example 6-1. A program datafile

<preferences>
  <font role="default">
    <name>Times New Roman</name>
    <size>14</size>
  </font>
  <window>
    <height>352</height>
    <width>417</width>
    <locx>100</locx>
    <locy>120</locy>
  </window>
</preferences>

XML::Simple makes accessing information in the datafile remarkably easy. Example 6-2 extracts default font information from it.

Example 6-2. Program to extract font information

use XML::Simple;
my $simple = XML::Simple->new(  );             # initialize the object
my $tree = $simple->XMLin( './data.xml' );   # read, store document
# test access to the tree
print "The user prefers the font " . $tree->{ font }->{ name } . " at " .
    $tree->{ font }->{ size } . " points.\n";

First we initialize an XML::Simple object, then we trigger the parser with a call to its

XMLin(
)

method. This step returns a reference to the root of the tree, which is a hierarchical set of hashes. Element names provide keys to the hashes, whose values are either strings or references to other element hashes. Thus, we have a clear and concise way to access points deep in the document.

To illustrate this idea, let's look at the data structure, using Data::Dumper, a module that serializes data structures. Just add these lines at the end of the program:

use Data::Dumper;
print Dumper( $tree );

And here's the output:

$tree = {
          'font' => {
                      'size' => '14',
                      'name' => 'Times New Roman',
                      'role' => 'default'
                    },
          'window' => {
                        'locx' => '100',
                        'locy' => '120',
                        'height' => '352',
                        'width' => '417'
                      }
        };

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Parser's Tree Mode

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

We used XML::Parser in Chapter 4 as an event generator to drive stream processing programs, but did you know that this same module can also generate tree data structures? We've modified our preference-reader program to use XML::Parser for parsing and building a tree, as shown in Example 6-4.

Example 6-4. Using XML::Parser to build a tree

# initialize parser and read the file
use XML::Parser;
$parser = new XML::Parser( Style => 'Tree' );
my $tree = $parser->parsefile( shift @ARGV );
# dump the structure
use Data::Dumper;
print Dumper( $tree );

When run on the file in Example 6-4, it gives this output:

$tree = [ 
          'preferences', [ 
            {}, 0, '\n', 
            'font', [ 
              { 'role' => 'console' }, 0, '\n',
              'size', [ {}, 0, '9' ], 0, '\n', 
              'fname', [ {}, 0, 'Courier' ], 0, '\n'
            ], 0, '\n',
            'font', [ 
              { 'role' => 'default' }, 0, '\n',
              'fname', [ {}, 0, 'Times New Roman' ], 0, '\n',
              'size', [ {}, 0, '14' ], 0, '\n'
            ], 0, '\n', 
            'font', [ 
               { 'role' => 'titles' }, 0, '\n',
               'size', [ {}, 0, '10' ], 0, '\n',
               'fname', [ {}, 0, 'Helvetica' ], 0, '\n',
            ], 0, '\n',
          ]
        ];

This structure is more complicated than the one we got from XML::Simple; it tries to preserve everything, including node type, order of nodes, and mixed text. Each node is represented by one or two items in a list. Elements require two items: the element name followed by a list of its contents. Text nodes are encoded as the number 0 followed by their values in a string. All attributes for an element are stored in a hash as the first item in the element's content list. Even the whitespace between elements has been saved, represented as 0, \n. Because lists are used to contain element content, the order of nodes is preserved. This order is important for some XML documents, such as books or animations in which elements follow a sequence.

XML::Parser

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::SimpleObject

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Using built-in data types is fine, but as your code becomes more complex and hard to read, you may start to pine for the neater interfaces of objects. Doing things like testing a node's type, getting the last child of an element, or changing the representation of data without breaking the rest of the program is easier with objects. It's not surprising that there are more object-oriented modules for XML than you can shake a stick at.

Dan Brian's XML::SimpleObject starts the tour of object models for XML trees. It takes the structure returned by XML::Parser in tree mode and changes it from a hierarchy of lists into a hierarchy of objects. Each object represents an element and provides methods to access its children. As with XML::Simple, elements are accessed by their names, passed as arguments to the methods.

Let's see how useful this module is. Example 6-5 is a silly datafile representing a genealogical tree. We're going to write a program to parse this file into an object tree and then traverse the tree to print out a text description.

Example 6-5. A genealogical tree

<ancestry>
  <ancestor><name>Glook the Magnificent</name>
    <children>
      <ancestor><name>Glimshaw the Brave</name></ancestor>
      <ancestor><name>Gelbar the Strong</name></ancestor>
      <ancestor><name>Glurko the Healthy</name>
        <children>
          <ancestor><name>Glurff the Sturdy</name></ancestor>
          <ancestor><name>Glug the Strange</name>
            <children>
              <ancestor><name>Blug the Insane</name></ancestor>
              <ancestor><name>Flug the Disturbed</name></ancestor>
            </children>
          </ancestor>
        </children>
      </ancestor>
    </children>
  </ancestor>
</ancestry>

Example 6-6 is our program. It starts by parsing the file with XML::Parser in tree mode and passing the result to an XML::SimpleObject constructor. Next, we write a routine begat( ) to traverse the tree and output text recursively. At each ancestor, it prints the name. If there are progeny, which we find out by testing whether the child method returns a non-

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::TreeBuilder

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XML::TreeBuilder is a factory class that builds a tree of XML::Element objects. The XML::Element class inherits from the older HTML::Element class that comes with the HTML::Tree package. Thus, you can build the tree from a file with XML::TreeBuilder and use the XML::Element accessor methods to move around, grab data from the tree, and change the structure of the tree as needed. We're going to focus on that last thing: using accessor methods to assemble a tree of our own.

For example, we're going to write a program that manages a simple, prioritized "to-do" list that uses an XML datafile to store entries. Each item in the list has an "immediate" or "long-term" priority. The program will initialize the list if it's empty or the file is missing. The user can add items by using -i or -l (for "immediate" or "long-term," respectively), followed by a description. Finally, the program updates the datafile and prints it out on the screen.

The first part of the program, listed in Example 6-7, sets up the tree structure. If the datafile can be found, it is read and used to build the tree. Otherwise, the tree is built from scratch.

Example 6-7. To-do list manager, first part

use XML::TreeBuilder;
use XML::Element;
use Getopt::Std;
# command line options
# -i         immediate
# -l         long-term
#
my %opts;
getopts( 'il', \%opts );
# initialize tree
my $data = 'data.xml';
my $tree;
# if file exists, parse it and build the tree
if( -r $data ) {
    $tree = XML::TreeBuilder->new(  );
    $tree->parse_file($data);
# otherwise, create a new tree from scratch
} else {
    print "Creating new data file.\n";
    my @now = localtime;
    my $date = $now[4] . '/' . $now[3];
    $tree = XML::Element->new( 'todo-list', 'date' => $date );
    $tree->push_content( XML::Element->new( 'immediate' ));
    $tree->push_content( XML::Element->new( 'long-term' ));
}

A few notes on initializing the structure are necessary. The minimal structure of the datafile is this:

<todo-list date="DATE">
  <immediate></immediate>
  <long-term></long-term>
</todo-list>

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::Grove

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The last object model we'll examine before jumping into standards-based solutions is Ken MacLeod's XML::Grove. Like XML::SimpleObject, it takes the XML::Parser output in tree mode and changes it into an object hierarchy. The difference is that each node type is represented by a different class. Therefore, an element would be mapped to XML::Grove::Element, a processing instruction to XML::Grove::PI, and so on. Text nodes are still scalar values.

Another feature of this module is that the declarations in the internal subset are captured in lists accessible through the XML::Grove object. Every entity or notation declaration is available for your perusal. For example, the following program counts the distribution of elements and other nodes, and then prints a list of node types and their frequency.

First, we initialize the parser with the style "grove" (to tell XML::Parser that it needs to use XML::Parser::Grove to process its output):

use XML::Parser;
use XML::Parser::Grove;
use XML::Grove;
my $parser = XML::Parser->new( Style => 'grove', NoExpand => '1' );
my $grove = $parser->parsefile( shift @ARGV );

Next, we access the contents of the grove by calling the contents( ) method. This method returns a list including the root element and any comments or PIs outside of it. A subroutine called tabulate( ) counts nodes and descends recursively through the tree. Finally, the results are printed:

# tabulate elements and other nodes
my %dist;
foreach( @{$grove->contents} ) {
  &tabulate( $_, \%dist );
}
print "\nNODES:\n\n";
foreach( sort keys %dist ) {
  print "$_: " . $dist{$_} . "\n";
}

Here is the subroutine that handles each node in the tree. Since each node is a different class, we can use ref( ) to get the type. Attributes are not treated as nodes in this model, but are available through the element class's method attributes( ) as a hash. The call to contents( ) allows the routine to continue processing the element's children:

# given a node and a table, find out what the node is, add to the count,
# and recurse if necessary
#
sub tabulate {
  my( $node, $table ) = @_;
  my $type = ref( $node );
  if( $type eq 'XML::Grove::Element' ) {
    $table->{ 'element' }++;
    $table->{ 'element (' . $node->name . ')' }++;
    foreach( keys %{$node->attributes} ) {
      $table->{ "attribute ($_)" }++;
    }
    foreach( @{$node->contents} ) {
      &tabulate( $_, $table );
    }
  } elsif( $type eq 'XML::Grove::Entity' ) {
    $table->{ 'entity-ref (' . $node->name . ')' }++;
  } elsif( $type eq 'XML::Grove::PI' ) {
    $table->{ 'PI (' . $node->target . ')' }++;
  } elsif( $type eq 'XML::Grove::Comment' ) {
    $table->{ 'comment' }++;
  } else {
    $table->{ 'text-node' }++
  }
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 7: DOM

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

In this chapter, we return to standard APIs with the Document Object Model (DOM). In Chapter 5, we talked about the benefits of using standard APIs: increased compatibility with other software components and (if implemented correctly) a guaranteed complete solution. The same concept applies in this chapter: what SAX does for event streams, DOM does for tree processing.

DOM is a recommendation by the World Wide Web Consortium (W3C). Designed to be a language-neutral interface to an in-memory representation of an XML document, versions of DOM are available in Java, ECMAscript, Perl, and other languages. Perl alone has several implementations of DOM, including XML::DOM and XML::LibXML.

While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves.

In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a Node object. The Node class is extended by more specific classes that represent the types of XML markup, including Element, Attr (attribute), ProcessingInstruction, Comment, EntityReference, Text, CDATASection, and Document. These classes are the building blocks of every XML tree in DOM.

The standard also calls for a couple of classes that serve as containers for nodes, convenient for shuttling XML fragments from place to place. These classes are NodeList, an ordered list of nodes, like all the children of an element; and NamedNodeMap, an unordered set of nodes. These objects are frequently required as arguments or given as return values from methods. Note that these objects are all live, meaning that any changes done to them will immediately affect the nodes in the document itself, rather than a copy.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

DOM and Perl

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

When naming these classes and their methods, DOM merely specifies the outward appearance of an implementation, but leaves the internal specifics up to the developer. Particulars like memory management, data structures, and algorithms are not addressed at all, as those issues may vary among programming languages and the needs of users. This is like describing a key so a locksmith can make a lock that it will fit into; you know the key will unlock the door, but you have no idea how it really works. Specifically, the outward appearance makes it easy to write extensions to legacy modules so they can comply with the standard, but it does not guarantee efficiency or speed.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

DOM Class Interface Reference

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Since DOM is becoming the interface of choice in the Perl-XML world, it deserves more elaboration. The following sections describe class interfaces individually, listing their properties, methods, and intended purposes.

The DOM specification calls for UTF-16 as the standard encoding. However, most Perl implementations assume a UTF-8 encoding. Due to limitations in Perl, working with characters of lengths other than 8 bits is difficult. This will change in a future version, and encodings like UTF-16 will be supported more readily.

The Document class controls the overall document, creating new objects when requested and maintaining high-level information such as references to the document type declaration and the root element.

Section 7.2.1.1: Properties

doctype: Document Type Declaration (DTD).
documentElement: The root element of the document.

Section 7.2.1.2: Methods

createElement, createTextNode, createComment, createCDATASection, createProcessingInstruction, createAttribute, createEntityReference: Generates a new node object.
createElementNS, createAttributeNS (DOM2 only): Generates a new element or attribute node object with a specified namespace qualifier.
createDocumentFragment: Creates a container object for a document's subtree.
getElementsByTagName: Returns a NodeList of all elements having a given tag name at any level of the document.
getElementsByTagNameNS (DOM2 only): Returns a NodeList of all elements having a given namespace qualifier and local name. The asterisk character (*) matches any element or any namespace, allowing you to find all elements in a given namespace.
getElementById (DOM2 only): Returns a reference to the node that has a specified ID attribute.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::DOM

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Enno Derkson's XML::DOM module is a good place to start exploring DOM in Perl. It's a complete implementation of Level 1 DOM with a few extra features thrown in for convenience. XML::DOM::Parser extends XML::Parser to build a document tree installed in an XML::DOM::Document object whose reference it returns. This reference gives you complete access to the tree. The rest, we happily report, works pretty much as you'd expect.

Here's a program that uses DOM to process an XHTML file. It looks inside <p> elements for the word "monkeys," replacing every instance with a link to monkeystuff.com. Sure, you could do it with a regular expression substitution, but this example is valuable because it shows how to search for and create new nodes, and read and change values, all in the unique DOM style.

The first part of the program creates a parser object and gives it a file to parse with the call to parsefile( ):

use XML::DOM;
&process_file( shift @ARGV );
sub process_file {
    my $infile = shift;
    my $dom_parser = new XML::DOM::Parser;            # create a parser object
    my $doc = $dom_parser->parsefile( $infile );      # make it parse a file
    &add_links( $doc );                               # perform our changes
    print $doc->toString;                             # output the tree again
    $doc->dispose;                                    # clean up memory
}

This method returns a reference to an XML::DOM::Document object, which is our gateway to the nodes inside. We pass this reference along to a routine called add_links( ), which will do all the processing we require. Finally, we output the tree with a call to toString( ) , and then dispose of the object. This last step performs necessary cleanup in case any circular references between nodes could result in a memory leak.

The next part burrows into the tree to start processing paragraphs:

sub add_links {
    my $doc = shift;                                  
    # find all the <p> elements
    my $paras = $doc->getElementsByTagName( "p" );
    for( my $i = 0; $i < $paras->getLength; $i++ ) {
        my $para = $paras->item( $i );
        # for each child of a <p>, if it is a text node, process it
        my @children = $para->getChildNodes;
        foreach my $node ( @children ) {
            &fix_text( $node ) if( $node->getNodeType eq TEXT_NODE );
        }
    }
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::LibXML

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Matt Sergeant's XML::LibXML module is an interface to the GNOME project's LibXML library. It's quickly becoming a popular implementation of DOM, demonstrating speed and completeness over the older XML::Parser based modules. It also implements Level 2 DOM, which means it has support for namespaces.

So far, we haven't worked much with namespaces. A lot of people opt to avoid them. They add a new level of complexity to markup and code, since you have to handle both local names and prefixes. However, namespaces are becoming more important in XML, and sooner or later, we all will have to deal with them. The popular transformation language XSLT uses namespaces to distinguish between tags that are instructions and tags that are data (i.e., which elements should be output and which should be used to control the output).

You'll even see namespaces used in good old HTML. Namespaces provide a way to import specialized markup into documents, such as equations into regular HTML pages. The MathML language (https://www.w3.org/Math/) does just that. Example 7-1 incorporates MathML into it with namespaces.

Example 7-1. A document with namespaces

<html>
<body xmlns:eq="https://www.w3.org/1998/Math/MathML">
<h1>Billybob's Theory</h1>
<p>
It is well-known that cats cannot be herded easily. That is, they do
not tend to run in a straight line for any length of time unless they
really want to. A cat forced to run in a straight line against its
will has an increasing probability, with distance, of deviating from
the line just to spite you, given by this formula:</p>
<p>
  <!-- P = 1 - 1/(x^2) -->
  <eq:math>
    <eq:mi>P</eq:mi><eq:mo>=</eq:mo><eq:mn>1</eq:mn><eq:mo>-</eq:mo>
    <eq:mfrac>
      <eq:mn>1</eq:mn>
      <eq:msup>
        <eq:mi>x</eq:mi>
        <eq:mn>2</eq:mn>
      </eq:msup>
    </eq:mfrac>
  </eq:math>
</p>
</body>
</html>

The tags with eq: prefixes are part of a namespace identified by the URI https://www.w3.org/1998/Math/MathML, defined in an attribute in the <body> element. Using a namespace helps the browser discern between what is native to HTML and what is not. Browsers that understand MathML route the qualified elements to their equation formatter instead of the regular HTML formatter.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 8: Beyond Trees: XPath, XSLT, and More

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

In the last chapter, we introduced the concepts behind handling XML documents as memory trees. Our use of them was kind of primitive, limited to building, traversing, and modifying pieces of trees. This is okay for small, uncomplicated documents and tasks, but serious XML processing requires beefier tools. In this chapter, we examine ways to make tree processing easier, faster, and more efficient.

The first in our lineup of power tools is the tree climber. As the name suggests, it climbs a tree for you, finding the nodes in the order you want them, making your code simpler and more focused on per-node processing. Using a tree climber is like having a trained monkey climb up a tree to get you coconuts so you don't have to scrape your own skin on the bark to get them; all you have to do is drill a hole in the shell and pop in a straw.

The simplest kind of tree climber is an iterator (sometimes called a walker ). It can move forward or backward in a tree, doling out node references as you tell it to move. The notion of moving forward in a tree involves matching the order of nodes as they would appear in the text representation of the document. The exact algorithm for iterating forward is this:

If there's no current node, start at the root node.
If the current node has children, move to the first child.
Otherwise, if the current node has a following sibling, move to it.
If none of these options work, go back up the list of the current node's ancestors and try to find one with an unprocessed sibling.

With this algorithm, the iterator will eventually reach every node in a tree, which is useful if you want to process all the nodes in a document part. You could also implement this algorithm recursively, but the advantage to doing it iteratively is that you can stop in between nodes to do other things. Example 8-1 shows how one might implement an iterator object for DOM trees. We've included methods for moving both forward and backward.

Example 8-1. A DOM iterator package

package XML::DOMIterator;
sub new {
  my $class = shift;
  my $self = {@_};
  $self->{ Node } = undef;
  return bless( $self, $class );
}
# move forward one node in the tree
#
sub forward {
  my $self = shift;
  # try to go down to the next level
  if( $self->is_element and
      $self->{ Node }->getFirstChild ) {
    $self->{ Node } = $self->{ Node }->getFirstChild;
  # try to go to the next sibling, or an acestor's sibling
  } else {
    while( $self->{ Node }) {
      if( $self->{ Node }->getNextSibling ) {
        $self->{ Node } = $self->{ Node }->getNextSibling;
        return $self->{ Node };
      }
      $self->{ Node } = $self->{ Node }->getParentNode;
    }
  }
}
# move backward one node in the tree
#
sub backward {
  my $self = shift;
  # go to the previous sibling and descend to the last node in its tree
  if( $self->{ Node }->getPreviousSibling ) {
    $self->{ Node } = $self->{ Node }->getPreviousSibling;
    while( $self->{ Node }->getLastChild ) {
      $self->{ Node } = $self->{ Node }->getLastChild;
    }
  # go up
  } else {
    $self->{ Node } = $self->{ Node }->getParentNode;
  }
  return $self->{ Node };
}
# return a reference to the current node
#
sub node {
  my $self = shift;
  return $self->{ Node };
}
# set the current node
#
sub reset {
  my( $self, $node ) = @_;
  $self->{ Node } = $node;
}
# test if current node is an element
#
sub is_element {
  my $self = shift;
  return( $self->{ Node }->getNodeType == 1 );
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Tree Climbers

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

If there's no current node, start at the root node.
If the current node has children, move to the first child.
Otherwise, if the current node has a following sibling, move to it.
If none of these options work, go back up the list of the current node's ancestors and try to find one with an unprocessed sibling.

Example 8-1. A DOM iterator package

package XML::DOMIterator;
sub new {
  my $class = shift;
  my $self = {@_};
  $self->{ Node } = undef;
  return bless( $self, $class );
}
# move forward one node in the tree
#
sub forward {
  my $self = shift;
  # try to go down to the next level
  if( $self->is_element and
      $self->{ Node }->getFirstChild ) {
    $self->{ Node } = $self->{ Node }->getFirstChild;
  # try to go to the next sibling, or an acestor's sibling
  } else {
    while( $self->{ Node }) {
      if( $self->{ Node }->getNextSibling ) {
        $self->{ Node } = $self->{ Node }->getNextSibling;
        return $self->{ Node };
      }
      $self->{ Node } = $self->{ Node }->getParentNode;
    }
  }
}
# move backward one node in the tree
#
sub backward {
  my $self = shift;
  # go to the previous sibling and descend to the last node in its tree
  if( $self->{ Node }->getPreviousSibling ) {
    $self->{ Node } = $self->{ Node }->getPreviousSibling;
    while( $self->{ Node }->getLastChild ) {
      $self->{ Node } = $self->{ Node }->getLastChild;
    }
  # go up
  } else {
    $self->{ Node } = $self->{ Node }->getParentNode;
  }
  return $self->{ Node };
}
# return a reference to the current node
#
sub node {
  my $self = shift;
  return $self->{ Node };
}
# set the current node
#
sub reset {
  my( $self, $node ) = @_;
  $self->{ Node } = $node;
}
# test if current node is an element
#
sub is_element {
  my $self = shift;
  return( $self->{ Node }->getNodeType == 1 );
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XPath

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Imagine that you have an army of monkeys at your disposal. You say to them, "I want you to get me a banana frappe from the ice cream parlor on Massachusetts Avenue just north of Porter Square." Not being very smart monkeys, they go out and bring back every beverage they can find, leaving you to taste them all to figure out which is the one you wanted. To retrain them, you send them out to night school to learn a rudimentary language, and in a few months you repeat the request. Now the monkeys follow your directions, identify the exact item you want, and return with it.

We've just described the kind of problem XPath was designed to solve. XPath is one of the most useful technologies supporting XML. It provides an interface to find nodes in a purely descriptive way, so you don't have to write code to hunt them down yourself. You merely specify the kind of nodes that interest you and an XPath parser will retrieve them for you. Suddenly, XML goes from becoming a vast, confusing pile of nodes to a well-indexed filing cabinet of data.

Consider the XML document in Example 8-4.

Example 8-4. A preferences file

<plist>
  <dict>
    <key>DefaultDirectory</key>
    <string>/usr/local/fooby</string>
    <key>RecentDocuments</key>
    <array>
      <string>/Users/bobo/docs/menu.pdf</string>
      <string>/Users/slappy/pagoda.pdf</string>
      <string>/Library/docs/Baby.pdf</string>
    </array>
    <key>BGColor</key>
    <string>sage</string>
  </dict>
</plist>

This document is a typical preferences file for a program with a series of data keys and values. Nothing in it is too complex. To obtain the value of the key BGColor, you'd have to locate the <key> element containing the word "BGColor" and step ahead to the next element, a <string>. Finally, you would read the value of the text node inside. In DOM, you might do it as shown in Example 8-5.

Example 8-5. Program to get a preferred color

sub get_bgcolor {
    my @keys = $doc->getElementsByTagName( 'key' );
    foreach my $key ( @keys ) {
        if( $key->getFirstChild->getData eq 'BGColor' ) {
            return $key->getNextSibling->getData;
        }
    }
    return;
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XSLT

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

If you think of XPath as a regular expression syntax, then XSLT is its pattern substitution mechanism. XSLT is an XML-based programming language for describing how to transform one document type into another. You can do some amazing things with XSLT, such as describe how to turn any XML document into HTML or tabulate the sum of figures in an XML-formatted table. In fact, you might not need to write a line of code in Perl or any language. All you really need is an XSLT script and one of the dozens of transformation engines available for processing XSLT.

XSLT stands for XML Style Language: Transformations. The name means that it's a component of the XML Style Language (XSL), assigned to handle the task of converting input XML into a special format called XSL-FO (the FO stands for "Formatting Objects"). XSL-FO contains both content and instructions for how to make it pretty when displayed.

Although it's stuck with the XSL name, XSLT is more than just a step in formatting; it's an important XML processing tool that makes it easy to convert from one kind of XML to another, or from XML to text. For this reason, the W3C (yup, they created XSLT too) released the recommendation for it years before the rest of XSL was ready.

To read the specification and find links to XSLT tutorials, look at its home page at https://www.w3.org/TR/xslt.

An XSLT transformation script is itself an XML document. It consists mostly of rules called templates, each of which tells how to treat a specific type of node. A template usually does two things: it describes what to output and defines how processing should continue.

Consider the script in Example 8-9.

Example 8-9. An XSLT stylesheet

<xsl:stylesheet
  xmlns:xsl="https://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:template match="html">
    <xsl:text>Title: </xsl:text>
    <xsl:value-of select="head/title"/>
    <xsl:apply-templates select="body"/>
  </xsl:template>
  <xsl:template match="body">
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="h1 | h2 | h3 | h4">
    <xsl:text>Head: </xsl:text>
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="p | blockquote | li">
    <xsl:text>Content: </xsl:text>
    <xsl:value-of select="."/>
  </xsl:template>
</xsl:stylesheet>

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Optimized Tree Processing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

The big drawback to using trees for XML crunching is that they tend to consume scandalous amounts of memory and processor time. This might not be apparent with small documents, but it becomes noticeable as documents grow to many thousands of nodes. A typical book of a few hundred pages' length could easily have tens of thousands of nodes. Each one requires the allocation of an object, a process that takes considerable time and memory.

Perhaps you don't need to build the entire tree to get your work done, though. You might only want a small branch of the tree and can safely do all the processing inside of it. If that's the case, then you can take advantage of the optimized parsing modes in XML::Twig (recall that we dealt with this module earlier in Section 8.2). These modes allow you to specify ahead of time what parts (or "twigs") of the tree you'll be working with so that only those parts are assembled. The result is a hybrid of tree and event processing with highly optimized performance in speed and memory.

XML::Twig has three modes of operation: the regular old tree mode, similar to what we've seen so far; "chunk" mode, which builds a whole tree, but has only a fraction of it in memory at a time (sort of like paged memory); and multiple roots mode, which builds only a few selected twigs from the tree.

Example 8-11 demonstrates the power of XML::Twig in chunk mode. The data to this program is a DocBook book with some <chapter> elements. These documents can be enormous, sometimes a hundred megabytes or more. The program breaks up the processing per chapter so that only a fraction of the space is needed.

Example 8-11. A chunking program

use XML::Twig;
# initalize the twig, parse, and output the revised twig
my $twig = new XML::Twig( TwigHandlers => { chapter => \&process_chapter });
$twig->parsefile( shift @ARGV );
$twig->print;
# handler for chapter elements: process and then flush up the chapter
sub process_chapter {
  my( $tree, $elem ) = @_;
  &process_element( $elem );
  $tree->flush_up_to( $elem );  # comment out this line to waste memory
}
# append 'foo' to the name of an element
sub process_element {
  my $elem = shift;
  $elem->set_gi( $elem->gi . 'foo' );
  my @children = $elem->children;
  foreach my $child ( @children ) {
    next if( $child->gi eq '#PCDATA' );
    &process_element( $child );
  }
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 9: RSS, SOAP, and Other XML Applications

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

In the next couple of chapters, we'll cover, at long last, what happens when we pull together all the abstract tools and strategies we've discussed and start having XML dance for us. This is the land of the XML application, where parsers all have a bone to pick, picking up documents with a goal in mind. No longer satisfied with picking out the elements and attributes and calling it a day, these higher-level tools look for meaning in all that structure, according to directives that have been programmed into it.

When we say XML application, we are specifically referring to XML-based document formats, not the computer programs (applications of another sort) that do stuff with them. You may run across statements such as "GreenMonkeyML is an XML application that provides semantic markup for green monkeys." Visiting the project's home page at https://www.greenmonkey-markup.com, we might encounter documentation describing how this specific format works, example documents, suggested uses for it, a DTD or schema used to validate GreenMonkeyML documents, and maybe an online validation tool. This content would all fit into the definition of an XML application.

This chapter looks at XML applications that already have a strong presence in the Perl world, by way of publicly available Perl modules that know how to handle them.

The term XML modules narrows us down from the Perl modules on CPAN that send mail, process images, and play games, but it still leaves us with a very broad cross section. So far in this book, we have exhaustively covered Perl extensions that can perform general XML processing, but none that perform more targeted functions based on general processing. In the end, they hand you a plate of XML chunklets, free of any inherent meaning, and leave it to you to decide what happens next. In many of the examples we've provided so far in this book, we have written programs that do exactly this: invoke an XML parser to chew up a document and then cook up something interesting out of the elements and attributes we get back.

However, the modules we're thinking about here give you more than the generic parse-and-process module family by building on one of the parsers and abstracting the processing in a specific direction. They then provide an API that, while it might still contain hooks into the raw XML, concentrates on methods and routines particular to the XML application that they implement.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Modules

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

We can divide these XML application-mangling Perl modules into three types. We'll examine an example of each in this chapter, and in the next chapter, we'll try to make some for ourselves.

XML application helpers: Helper modules are the humblest of the lot. In practice, they are often little more than wrappers around raw XML processors, but sometimes that's all you need. If you find yourself writing several programs that need to read from and write to a specific XML-based document format, a helper module can provide common methods, freeing the programmer from worrying about the application's exact document format or its well-formedness in generated output. The module will take care of all that.
Programming helpers that use XML: This small but growing category describes Perl extensions that use XML to do cool stuff in your program, even if your program's input or output has little to do with XML. Currently, the most prominent examples involve the terrifying, DBI-like powers of

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML::RSS

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

By helper modules, we mean more focused versions of the XML processors we've already pawed through in our Perl and XML toolbox. In a way, XML::Parser and its ilk are helper applications since they save you from approaching each XML-chomping job with Perl's built-in file-reading functions and regular expressions by turning documents into immediately useful objects or event streams. Also, XML::Writer and friends replace plain old print statements with a more abstract and safer way to create XML documents.

However, the XML modules we cover now offer their services in a very specific direction. By using one of these modules in your program, you establish that you plan to use XML, but only a small, clearly defined subsection of it. By submitting to this restriction, you get to use (and create) software modules that handle all the toil of working with raw XML, presenting the main part of your code with methods and routines specific only to the application at hand.

For our example, we'll look at XML::RSS—a little number by Jonathan Eisenzopf.

RSS (short for Rich Site Summary or Really Simple Syndication, depending upon whom you ask) is one of the first XML applications whose use became rapidly popular on a global scale, thanks to the Web. While RSS itself is little more than an agreed-upon way to summarize web page content, it gives the administrators of news sites, web logs, and any other frequently updated web site a standard and sweat-free way of telling the world what's new. Programs that can parse RSS can do whatever they'd like with this document, perhaps telling its masters by mail or by web page what interesting things it has learned in its travels. A special type of RSS program is an aggregator, a program that collects RSS from various sources and then knits it together into new RSS documents combining the information, so that lazier RSS-parsing programs won't have to travel so far.

Current popular aggregators include Netscape, by way of its customizable my.netscape.com site (which was, in fact, the birthplace of the earliest RSS versions) and Dave Winer's

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

XML Programming Tools

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Now we'll cover software that performs a somewhat inverse role compared to the ground we just covered. Instead of giving you Perl-lazy ways to work with XML documents, it uses XML standards to make things easier for a task that doesn't explicitly involve XML. Recently, some key folk in the community from the perl-xml mailing list have been seeking a mini-platform of universal data handling in Perl with SAX at its core. Some very interesting (and useful) examples have been born from this research, including Ilya Sterin's XML::SAXDriver::Excel and XML::SAXDriver::CSV, and Matt Sergeant's XML::Generator::DBI. All three modules share the ability to take a data format—Microsoft Excel files, Comma-Separated Value files, and SQL databases, respectively—and wrap a SAX API around it (the same sort covered in Chapter 5, so that any programmer can merrily pretend that the format is as well behaved and manageable as all the other XML documents they've seen (even if the underlying module is quietly performing acrobatics akin to medicating cats).

We'll look more closely at one of these tools, as its subject matter has some interesting implications involving recent developments, before we move on to this chapter's final section.

XML::Generator::DBI is a fine example of a glue module, a simple piece of software whose only job is to take two existing (but not entirely unrelated) pieces of software and let them talk to one another. In this case, when you construct an object of this class, you hand it your additional objects: a DBI-flavored database handle and a SAX-speaking handler object.

XML::Generator::DBI does not know or care how or where the objects came from, but only trusts that they respond to the standard method calls of their respective families (either DBI, SAX, or SAX2). Then you can call an execute method on the XML::Generator::DBI object with an ordinary SQL statement, much as you would with a DBI-created database handle.

The following example shows this module in action. The SAX handler in question is an instance of Michael Koehne's

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

SOAP::Lite

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Finally, we come to the category of Perl and XML software that is so ridiculously abstracted from the book's topic that it's almost not worth covering, but it's definitely much more worth showing off. This category describes modules and extensions that are similar to the XML::RSS class helper modules; they help you work with a specific variety of XML documents, but set themselves apart by the level of aggression they employ to keep programmers separated from the raw, element-encrusted data flowing underneath it. They involve enough layers of abstraction to make you forget that you're even dealing with XML in the first place.

Of course, they're perfectly valid in doing so; for example, if we want to write a program that uses the SOAP or XML-RPC protocols to use remote code, nothing could be further from our thoughts than XML. It's all a magic carpet, as far as we're concerned—we just want our program to work! (And when we do care, a good module lets us peek at the raw XML, if we insist.)

The Simple Object Access Protocol (SOAP) gives you the power of object-oriented web services by letting you construct and use objects whose class definitions exist at the other end of a URI. You don't even need to know what programming language they use because the protocol magically turns the object's methods into a common, XML-based API. As long as the class is documented somewhere, with more details of the available class and object methods, you can hack away as if the class was simply another file on your hard drive, despite the fact that it actually exists on a remote machine.

At this point it's entirely too easy to forget that we're working with XML. At least with RSS, the method names of the object API more or less match those of the resulting output document; in this case, We don't even want to see the horrible machine-readable-only document any more than we'd want to see the numeric codes representing keystrokes that are sent to our machine's CPU.

SOAP::Lite's name refers to the amount of work you have to apply when you wish to use it, and does not reflect its own weight. When you install it on your system, it makes a long list of Perl packages available to you, many of which provide a plethora of transportation styles, a

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Chapter 10: Coding Strategies

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

This chapter sends you off by bringing this book's topics full circle. We return to many of the themes about XML processing in Perl that we introduced in Chapter 3, but in the context of all the detailed material that we've covered in the interceding chapters. Our intent is to take you on one concluding tour through the world of Perl and XML, with its strategies and its gotchas, before sending you on your way.

You've seen XML namespaces used since we first mentioned this concept back in Chapter 2. Many XML applications, such as XSLT, insist that all their elements claim fealty to a certain namespace. The deciding factor here usually involves how symbiotic the application is in its usual use: does it usually work on its own, with a one-document-per-application style, or does it tend to mix with other sorts of XML?

DocBook XML, for example, is not very symbiotic. An instance of DocBook is almost always a whole XML document, defining a book or an article, and all the elements within such a document that aren't explicitly tied to some other namespace are found in the official DocBook documentation. However, within a DocBook document, you might encounter a clump of MathML elements making their home in a rather parasitic fashion, nestled in among the folds of the DocBook elements, from which it derives nourishing context.

This sort of thing is useful for two reasons: first, DocBook, while its element spread tries to cover all kinds of things you might find in a piece of technical documentation, doesn't have the capacity to richly describe everything that might go into a mathematical equation. (It does have <equation> elements, but they are often used to describe the nature of the graphic contained within them.) By adding MathML into the mix, you can use all the tags defined by that markup language's specification inside of a DocBook document, tucked away safely in their own namespace. (Since MathML and DocBook work so well together, the DocBook DTD allows a user to plug in a "MathML module," which adds a <mml:math> element to the mix. Within this mix, everything is handled by MathML's own DTD, which the module imports (along with DocBook's main DTD) into the whole DTD-space when validating.)

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Perl and XML Namespaces

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

Second, and perhaps more interesting from the parser's point of view, tags existing in a given namespace work like embassies; while you stand on its soil (or in its scope), all that country's rules and regulations apply to you, despite the embassy's location in a foreign land. XML namespaces are also similar to Perl namespaces, which let you invoke variables, subroutines, and other symbols that live inside

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Subclassing

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

When writing XML-hacking Perl modules, another path to laziness involves standing on (and reading over) the shoulders of giants by subclassing general XML parsers as a quick way to build application-specific modules.

You don't have to use object inheritance; the least complicated way to accomplish this sort of thing involves constructing a parser object in the usual way, sticking it somewhere convenient, and turning around whenever you want to do something XMLy. Here is some bogus code for you:

package XML::MyThingy;
use strict; use warnings;
use XML::SomeSortOfParser;
sub new {
  # Ye Olde Constructor
  my $invocant = shift;
  my $self = {};
  if (ref($invocant)) {
    bless ($self, ref($invocant));
  } else {
    bless ($self, $invocant);
  }
  # Now we make an XML parser...
  my $parser = XML::SomeSortOfParser->new 
      or die "Oh no, I couldn't make an XML parser. How very sad.";
  # ...and stick it on this object, for later reference.
  $self->{xml} = $parser;
  return $self;
}
sub parse_file {
  # We'll just pass on the user's request to our parser object (which
  # just happens to have a method named parse_file)...
  my $self = shift;
  my $result = $self->{xml}->parse_file;
  # What happens now depends on whatever a XML::SomeSortOfParser
  # object does when it parses a file. Let's say it modifies itself and
  # returns a success code, so we'll just keep hold of the now-modified
  # object under this object's 'xml' key, and return the code.
  return $result;
}

Choosing to subclass a parser has some bonuses, though. First, it gives your module the same basic user API as the module in question, including all the methods for parsing, which can be quite lazily useful—especially if the module you're writing is an XML application helper module. Second, if you're using a tree-based parser, you can steal—er, I mean embrace and extend—that parser's data structure representation of the parsed document and then twist it to better serve your own nefarious goal while doing as little extra work as possible. This step is possible through the magic of Perl's class blessing and inheritance functionality.

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Converting XML to HTML with XSLT

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

If you've done any web hacking with Perl before, then you've kinda-sorta used XML, since HTML isn't too far off from the well-formedness goals of XML, at least in theory. In practice, HTML is used more frequently as a combination of markup, punctuation, embedded scripts, and a dozen other things that make web pages act nutty (with most popular web browsers being rather forgiving about syntax).

Currently, and probably for a long time to come, the language of the Web remains HTML. While you can use bona fide XML in your web pages by clinging to the W3C's XHTML, it's far more likely that you'll need to turn it into HTML when you want to apply your XML to the Web.

You can go about this in many ways. The most sledgehammery of these involves parsing your document and tossing out the results in a CGI script. This example reads a local MonkeyML file of my pet monkeys' names, and prints a web page to standard output (using Lincoln Stein's ubiquitous CGI module to add a bit of syntactic sugar):

#!/usr/bin/perl
use warnings;
use strict;
use CGI qw(:standard);
use XML::LibXML;
my $parser = XML::XPath;
my $doc = $parser->parse_file('monkeys.xml');
print header;
print start_html("My Pet Monkeys");
print h1("My Pet Monkeys");
print p("I have the following monkeys in my house:");
print "<ul>\n";
foreach my $name_node ($doc->documentElement->findnodes("//mm:name")) {
    print "<li>" . $name_node->firstChild->getData ."</li>\n";
}
print end_html;

Another approach involves XSLT.

XSLT is used to translate one type of XML into another. XSLT factors in strongly here because using XML and the Web often requires that you extract all the presentable pieces of information from an XML document and wrap them up in HTML. One very high-level XML-using application, Matt Sergeant's AxKit (https://www.axkit.org), bases an entire application server framework around this notion, letting you set up a web site that uses XML as its source files, but whose final output to web browsers is HTML (and whose final output to other devices is whatever format best applies to them).

Let's make a little module that converts DocBook files into HTML on the fly. Though our goals are not as ambitious as AxKit's, we'll still take a cue from that program by basing our code around the Apache

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

A Comics Index

Content preview·Buy PDF of this chapter|Buy reprint rights for this chapter

XSLT is one thing, but the potential for Perl, XML, and the Web working together is as unlimited as, well, anything else you might choose to do with Perl and the Web. Sometimes you can't just toss refactored XML at your clients, but must write Perl that wrings interesting information out of XML documents and builds something Webbish out of the results. We did a little of that in the previous example, mixing the raw XSLT usage when transforming the DocBook documents with index page generation.

Since we've gone through all the trouble of covering syndication-enabling XML technologies such as RSS and ComicsML in this chapter and Chapter 9, let's write a little program that uses web syndication. To prove (or perhaps belabor) a point, we'll construct a simple CGI program that builds an index of the user's favorite online comics (which, in our fantasy world, all have ComicsML documents associated with them):

#!/usr/bin/perl
# A very simple ComicsML muncher; given a list of URLs pointing to
# ComicsML documents, fetch them, flatten their strips into one list,
# and then build a web page listing, linking to, and possibly
# displaying these strips, sorted with newest first.
use warnings;
use strict;
use XML::ComicsML;                # ...so that we can build ComicsML objects
use CGI qw(:standard);
use LWP;
use Date::Manip;             # Cuz we're too bloody lazy to do our own date math
# Let's assume that the URLs of my favorite Internet funnies' ComicsML
# documents live in a plaintext file on disk, with one URL per line
# (What, no XML? For shame...)
my $url_file = $ARGV[0] or die "Usage: $0 url-file\n";
my @urls;                        # List of ComicsML URLs
open (URLS, $url_file) or die "Can't read $url_file: $!\n";
while (<URLS>) { chomp; push @urls, $_; }
close (URLS) or die "Can't close $url_file: $!\n";
# Make an LWP user agent
my $ua = LWP::UserAgent->new;
my $parser = XML::ComicsML->new;
my @strips; # This will hold objects representing comic strips
foreach my $url (@urls) {
  my $request = HTTP::Request->new(GET=>$url);
  my $result = $ua->request($request);
  my $comic;                        # Will hold the comic we'll get back
  if ($result->is_success) {
    # Let's see if the ComicsML parser likes it.
    unless ($comic = $parser->parse_string($result->content)) {
      # Doh, this is not a good XML document.
      warn "The document at $url is not good XML!\n";
      next;
    }
  } else {
    warn "Error at $url: " . $result->status_line . "\n";
    next;
  }
  # Now peel all the strips out of the comic, pop each into a little
  # hashref along with some information about the comic itself.
  foreach my $strip ($comic->strips) {
    push (@strips, {strip=>$strip, comic_title=>$comic->title, comic_url=>$comic->url});
  }
}
# Sort the list of strips by date.  (We use Date::Manip's exported
# UnixDate function here, to turn their unweildy Gregorian calendar
# dates into nice clean Unixy ones)
my @sorted = sort {UnixDate($$a{strip}->date, "%s") <=> UnixDate($$b{strip}->date, "%s")} @strips;
# Now we build a web page!
print header;
print start_html("Latest comix");
print h1("Links to new comics...");
# Go through the sorted list in reverse, to get the newest at the top.
foreach my $strip_info (reverse(@sorted)) {
  my ($title, $url, $svg);
  my $strip = $$strip_info{strip};
  $title = join (" - ", $strip->title, $strip->date);
  # Hyperlink the title to a URL, if there is one provided
  if ($url = $strip->url) {
    $title = "<a href='$url'>$title</a>";
  }
  # Give similar treatment to the comics' title and URL
  my $comic_title = $$strip_info{comic_title};
  if ($$strip_info{comic_url}) {
    $comic_title = "<a href='$$strip_info{comic_url}'>$comic_title</a>";
  }
  # Print the titles
  print p("<b>$comic_title</b>: $title");
  
  print "<hr />";
}
print end_html;

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Return to Perl and XML

Original Source | Taken Source