| CARVIEW |
ClientTable
This module is unmaintained.
ClientTable is a Python module for generic HTML table parsing. It
is most useful when used in conjunction with other parsers
(htmllib or HTMLParser, regular expressions,
etc.), to divide up the parsing work between your own code and
ClientTable.
import ClientTable import urllib2 response = urllib2.urlopen("https://www.acme.com/tables.html") tables = ClientTable.ParseFile(response, collapse_whitespace=1) table = tables[0] # Indexing a table with a string-like object gets the column under that # header. ClientTable uses the first row of headers in the table by # default. assert str(table.headers_row[0]) == "Widget production" row = table[1] col = table["Widget production"] cell = col[1] cell2 = row["Widget production"] cell3 = row.get_cell_by_nr(0) assert cell is cell2 is cell3 # HTMLTables are Python 2.2 iterators print ", ".join(table.headers) for row in table: # TableRows are sequences if row.is_header: continue print ", ".join(map(str, row)) # HTMLTable.col_iter returns a Python 2.2 iterator over columns for col in table.col_iter(): # TableColumns are sequences, too col_data = filter(lambda item: not item.is_header, col) col_data = map(lambda el: int(str(el)), col_data) print "sum of", col.header, "=", reduce(lambda x,y: x+y, col_data)
Python 2.2 or above is required. I will probably backport it to at least Python 2.0 later.
For full documentation, see the docstrings in ClientTable.py.
Download
WARNING: This is an alpha release: interfaces will change, and don't
expect everything to work! I'm looking for feedback on the API ATM, so
comments are particularly welcome.
One thing that will certainly change is the way column headers are
specified by indexing, and by the various get_x_by_name
methods. At the moment, an exact match is required. I plan to change
this so a substring match is the default, with regular expression
search, tag-stripping and exact matching as optional arguments.
Missing features: single_span and
nr_toplevel_to_parse arguments to ParseFile
are yet to be implemented.
- ClientTable-0.0.1a.tar.gz
- ClientTable-0_0_1a.zip
- Change Log (included in distribution)
FAQs - pre-install
- Which version of Python do I need?
2.2
- Do I have to use
urllib2?Of course not!
- Which license?
The MIT license.
- Where can I find out more about the HTML standard?
- W3C HTML 4.01 Specification.
- RFC 1866 - the HTML 2.0 specification.
- RFC 1942 - HTML Tables.
FAQs - usage
- Does it cope with nested tables?
Yes.
- Does it cope with all the various rowspans and colspans?
Yes.
- How?
You get the same cell object back more than once when indexing or iterating (unless you pass the single_span (NOT IMPLEMENTED YET) argument to ParseFile).
- How do I strip HTML tags from cell contents?
Pass the strip_tags argument to ParseFile. Note that headers are cells, too.
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, May 2006.