CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Fri, 16 Jan 2026 18:55:29 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20060908192937 location: https://web.archive.org/web/20060908192937/https://wwwsearch.sourceforge.net/pullparser/ server-timing: captures_list;dur=0.543570, exclusion.robots;dur=0.044969, exclusion.robots.policy;dur=0.032919, esindex;dur=0.011679, cdx.remote;dur=16.997069, LoadShardBlock;dur=252.525253, PetaboxLoader3.datanode;dur=86.890506, PetaboxLoader3.resolve;dur=11.338850 x-app-server: wwwb-app225-dc8 x-ts: 302 x-tr: 292 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: wb-p-SERVER=wwwb-app225; path=/ x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Fri, 16 Jan 2026 18:55:29 GMT content-type: text/html x-archive-orig-date: Fri, 08 Sep 2006 19:29:37 GMT x-archive-orig-server: Apache/1.3.33 (Unix) PHP/4.3.10 x-archive-orig-last-modified: Sat, 06 May 2006 22:58:07 GMT x-archive-orig-etag: "9b55e1-185a-445d29ff" x-archive-orig-accept-ranges: bytes x-archive-orig-content-length: 6234 x-archive-orig-connection: close x-archive-guessed-content-type: text/html x-archive-guessed-charset: iso-8859-1 memento-datetime: Fri, 08 Sep 2006 19:29:37 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 29_0_20060908191309_crawl23-c/29_0_20060908192935_crawl28.arc.gz server-timing: captures_list;dur=0.559825, exclusion.robots;dur=0.017333, exclusion.robots.policy;dur=0.007351, esindex;dur=0.012069, cdx.remote;dur=18.247308, LoadShardBlock;dur=136.103811, PetaboxLoader3.datanode;dur=153.400406, load_resource;dur=129.029078, PetaboxLoader3.resolve;dur=34.886889 x-app-server: wwwb-app225-dc8 x-ts: 200 x-tr: 321 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-as: 14061 x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip pullparser

pullparser

This module is currently unmaintained (now part of mechanize, but interface no longer public).

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)

This program extracts the <title> from the document:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title

Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution.

Stable release.

For installation instructions, see the INSTALL file included in the distribution.

Subversion

The Subversion (SVN) trunk is https://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:

svn co https://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser

FAQs

Which version of Python do I need?
2.2.1 or above.
Which license?
pullparser is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).
Why does it fail to parse my HTML?
Because module HTMLParser is fussy. Try pullparser.TolerantPullParser instead, which uses module sgmllib instead. Note that self-closing tags (<foo/>) will show up as 'starttag' tags, not 'startendtag' tags if you use this class - this is a limitation of module sgmllib.
Why don't I see the tokens I expect?
- Are there missing end-tags in your HTML? (Maybe this will improve in future.)
- Element names passed to methods such as PullParser.get_token() must be given in lower case - maybe you forgot that? (Element names in the HTML can be any case, of course.)
- HTMLParser.HTMLParser isn't very robust. Would be fairly easy to (perhaps optionally) rebase on the other standard library HTML parsing module, sgmllib.SGMLParser (which is really an HTML parser, not a full SGML parser, despite the name). I'm not going to do that, though.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, May 2006.

Original Source | Taken Source

pullparser

Download

Subversion

See also

FAQs