| CARVIEW |
pullparser
This module is currently unmaintained (now part of mechanize, but interface no longer public).
A simple "pull API" for HTML parsing, after Perl's
HTML::TokeParser. Many simple HTML parsing tasks are
simpler this way than with the HTMLParser module.
pullparser.PullParser is a subclass of
HTMLParser.HTMLParser.
Examples:
This program extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<a>...</a> tags:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text)
This program extracts the <title> from the document:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title
Thanks to Gisle Aas, who wrote HTML::TokeParser.
Download
All documentation (including this web page) is included in the distribution.
Stable release.
- pullparser-0.1.0.tar.gz
- pullparser-0.1.0.zip
- Change Log (included in distribution)
- Older versions.
For installation instructions, see the INSTALL file included in the distribution.
Subversion
The Subversion (SVN) trunk is https://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:
svn co https://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser
See also
Beautiful Soup is widely recommended. More robust than this module.
I recommend Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module.
FAQs
- Which version of Python do I need?
2.2.1 or above.
- Which license?
pullparser is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).
- Why does it fail to parse my HTML?
Because module
HTMLParseris fussy. Trypullparser.TolerantPullParserinstead, which uses modulesgmllibinstead. Note that self-closing tags (<foo/>) will show up as 'starttag' tags, not 'startendtag' tags if you use this class - this is a limitation of modulesgmllib. - Why don't I see the tokens I expect?
- Are there missing end-tags in your HTML? (Maybe this will improve in future.)
- Element names passed to methods such as PullParser.get_token() must be given in lower case - maybe you forgot that? (Element names in the HTML can be any case, of course.)
HTMLParser.HTMLParserisn't very robust. Would be fairly easy to (perhaps optionally) rebase on the other standard library HTML parsing module,sgmllib.SGMLParser(which is really an HTML parser, not a full SGML parser, despite the name). I'm not going to do that, though.
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, May 2006.