CARVIEW |
- Home
- Services
-
Products
- Python
- · mxODBC – Python Database Interface
- · mxODBC Connect – Remote DB Interface
- · eGenix PyRun – Tiny Python Runtime
- · eGenix pyOpenSSL – SSL Interface
- · eGenix mx Base
- · mxDateTime – Date/Time Library
- · mxTextTools – Fast Text Parsing
- · mxBeeBase – BTree+ On-disk DB
- · mxTools – Fast Python Helpers
- · mxProxy – Python Object Proxy
- · mxURL – URL Library
- · mxUID – UID Library
- · mxStack – Stack Data Type
- · mxQueue – Queue Data Type
- · eGenix mx Experimental
- · mxNumber – High Precision Numbers
- · mxTidy – HTML Cleaner
- Plone / Zope
- · mxODBC Plone/Zope Database Adapter
- · mxODBC Database Adapter for Plone
- · eGenix ThreadLock Distribution
- Solutions
- Support
- Community
- Company
- Shop
mxTextTools - Fast Text Processing for Python

Introduction
mxTextTools™ is a collection of high-speed string manipulation routines and new Python objects for dealing with common text processing tasks.
Tagging Engine
One of the major features of this package is the integrated mxTextTools Tagging Engine which allows accessing the speed of compiled C programs while maintaining the portability of Python. The Tagging Engine uses byte code "programs" written in form of Python tuples. These programs are then compiled into an internal binary form which gets processed by a very fast virtual machine designed specifically for scanning text data.
As a result, the Tagging Engine allows parsing text at higher speeds than e.g. regular expression packages while still maintaining the flexibility of programming the parser in Python. Callbacks and user-defined matching functions extend this approach far beyond what you could do with other common text processing methods.
About the word tagging: this originated from what is done in SGML, HTML and XML, namely to mark text with a certain extra information. The Tagging Engine abstracts this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the tag object) which can be used to do all kinds of useful things.
Search Objects
The two other major features of mxTextTools are the search and character set objects provided by the package. Both are implemented in C to give you maximum performance on all supported platforms.
Using mxTextTools for Language Parsing
At EuroPython 2007, we have given a talk about mxTextTools and how it can be used to parse languages. Please see our Presentations & Talks section for details.
Features
- Fast, memory efficient, highly customizable.
- High-performance text scanner that runs compiled byte-code on a portable virtual machine.
- Allows writing scanners that work at C speed without the need to drop to C for programming.
- Faster and more flexible than regular expressions.
- Fast search objects.
- Efficient character set matching objects.
- Handy routines for everyday text manipulation work.
- Works on 8-bit strings as well as Unicode text.
- Stable, robust and portable.
- Free to use and redistribute.
System Requirements
mxTextTools is written in a very portable way and works on pretty much all platforms where you can compile Python.
We provide precompiled versions of mxTextTools for all standard platforms, so all you need is a working Python installation. The package supports all Python versions since Python 2.1.
The only requirement for compiling the package from source is an ANSI C compiler. There are no third-party libraries needed.
License
mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for details regarding the license.
Documentation
The following documentation is available for mxTextTools:
mxTextTools User Manual and Reference Guide - HTML and PDF
This manual includes a discussion of the various design principles behind mxTextTools Tagging Engine and the search objects, their implementation, as well as a reference of the available programming interfaces.
The PDF file is also available as part of the installation and can be found in the mx/TextTools/Doc/
folder.
Books
If you are looking for more tutorial style documentation of mxTextTools, there's a book by David Mertz about Text Processing with Python which covers mxTextTools and other text oriented tools at great length.
Download & Installation
mxTextTools is provided as part of the eGenix.com mx Base Distribution. Please see the mx Base Distribution page for downloads and installation instructions.
References
mxTextTools was originally written for the eGenix.com Application Server to allow fast templating of web pages and related resources.
Since then, it has been used in a wide variety of other areas. Some notable and publically available applications using mxTextTools are: BioPython (Andrew Dalke's Martel uses it as parsing engine) and SimpleParse (Mike Fletcher's parser generator for mxTextTools which he uses for parsing VRML files), also see David Mertz's article about it on IBM Developer Works.
History & Changes
Please see the change log for details regarding changes to the package between releases.