TextTools - Fast Text Manipulation Tools for Python

mxTextTools - Fast Text Manipulation Tools for Python

Engine : TextSearch Objects : CharSet Objects : Functions : Constants : Examples : Structure : Support : Download : Copyright & License : History : Home

Version 2.1.0

Introduction

mxTextTools is a collection of high-speed string manipulation routines and new Python objects for dealing with common text processing tasks.

One of the major features of this package is the integrated Tagging Engine which allows accessing the speed of compiled C programs while maintaining the portability of Python. The Tagging Engine uses byte code "programs" written in form of Python tuples. These programs are then translated into an internal binary form which gets processed by a very fast virtual machine designed specifically for scanning text data.

As a result, the Tagging Engine allows parsing text at higher speeds than e.g. regular expression packages while still maintaining the flexibility of programming the parser in Python. Callbacks and user-defined matching functions extends this approach far beyond what you could do with other common text processing methods.

Two other major features are the search and character set objects provided by the package. Both are implemented in C to give you maximum performance on all supported platforms.

A note about the word 'tagging': This originated from what is done in HTML to mark some text with a certain extra information. The Tagging Engine extends this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the object) which can be used to do all kinds of useful things.

If you are looking for more tutorial style documentation of mxTextTools, there's a new book by David Mertz about Text Processing with Python which covers mxTextTools and other text oriented tools at great length.

Tagging Engine

The Tagging Engine is a low-level virtual machine (VM) which executes text search specific byte codes. This byte code is passed to the engine in form of Tag Tables which define the "program" to execute in terms of commands and command arguments.

The Tagging Engine is capable of handling 8-bit text and Unicode (with some minor exceptions). Even combinations of the two string formats are accepted, but should be avoided for performance reasons in production code.

Tag List

Marking certains parts of a text should not involve storing hundreds of small strings. This is why the Tagging Engine uses a specially formatted list of tuples to return the results.

A Tag List is a list of tuples marking certain slices of a text. The tuples always have the format

(object, left_index, right_index, subtags)

with the meaning: object contains information about the slice [left_index:right_index] in some text. subtags is either another Tag List created by recursively invoking the Tagging Engine or None.

Note: Only the commands Table and TableInList create new Tag Lists and make them available via subtags and then only if the Tagging Engine was not called with None as value for the taglist. All other commands set this tuple entry to None. This is important to know if you want to analyze a generated Tag List, since it may require recursing into the subtags Tag List if that entry is not None.

Tag Table

To create such taglists, you have to define a Tag Table and let the Tagging Engine use it to mark the text. Tag Tables are defined using standard Python tuples containing other tuples in a specific format:

tag_table = (('lowercase',AllIn,a2z,+1,+2),
	     ('upper',AllIn,A2Z,+1),
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),
	     (None,EOF,Here,-4)) # EOF

The tuples contained in the table use a very simple format:

(tagobj, command+flags, command_argument
	      		[,jump_no_match] [,jump_match=+1])

Starting with version 2.1.0 of mxTextTools, the Tagging Engine no longer uses these tuples directly, but instead compiles the Tag Table definitions into special TagTable objects. These objects are then processed by the Tagging Engine.

Even though the tag() Tagging Engine API compiles Tag Table definitions into the TagTable object on-the-fly, you can also compile the definitions yourself and then pass the TagTable object directly to tag().

Jump Target Support

To simplify writing Tag Table definitions, the Tag Table compiler also allows using string jump targets instead of jump offsets in the tuples:

tag_table = (
             'start',
             ('lowercase',AllIn,a2z,+1,'skip'),
	     ('upper',AllIn,A2Z,'skip'),

             'skip',
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),

	     (None,EOF,Here,'start')) # EOF

These strings can be used as jump targets for jne and je when compiling the definition using TagTable() or UnicodeTagTable() and then get replaced with the numeric relative offsets at compile time.

The Tagging Engine has a new command JumpTarget for this purpose which is implemented as no operation (NOP) command.

Tag Table Compiler

Starting with version 2.1.0 of mxTextTools, the Tagging Engine uses compiled TagTable instances for performing the scanning. These TagTables are Python objects which can be created explicitely using a tag table definition in form of a tuple or a list (the latter are not cacheable, so it's usually better to transform the list into a tuple before passing it to the TagTable constructor).

The TagTable() constructor will "compile" and check the tag table definition. It then stores the table in an internal data structure which allows fast access from within the Tagging Engine. The compiler also takes care of any needed conversions such as Unicode to string or vice-versa.

There are generally two different kinds of compiled TagTables: one for scanning 8-bit strings and one for Unicode. tag() will complain if you try to scan strings with a UnicodeTagTable or Unicode with a string TagTable.

Note that tag() can take TagTables and tuples as tag table input. If given a tuple, it will automatically compile the tuple into a TagTable needed for the requested type of text (string or Unicode).

Caching of Compiled Tag Tables

The TagTable() constructor caches compiled TagTables if they are defined by a tuple and declared as cacheable. In that case, the compile TagTable will be stored in a dictionary addressed by the definition tuple's id() and be reused if the same compilation is requested again at some later point. The cache dictionary is exposed to the user as tagtable_cache dictionary. It has a hard limit of 100 entries, but can also be managed by user routines to lower this limit.

Semantics

The Tagging Engine reads the Tag Table starting at the top entry. While performing the command actions (see below for details) it moves a read-head over the characters of the text. The engine stops when a command fails to match and no alternative is given or when it reaches a non-existing entry, e.g. by jumping beyond the end of the table.

Tag Table entries are processed as follows:

If the command matched, say the slice text[l:r], the default action is to append (tagobj,l,r,subtags) to the taglist (this behaviour can be modified by using special flags; if you use None as tagobj, no tuple is appended) and to continue matching with the table entry that is reached by adding jump_match to the current position (think of them as relative jump offsets). The head position of the engine stays where the command left it (over index r), e.g. for (None,AllIn,'A') right after the last 'A' matched.

In case the command does not match, the engine either continues at the table entry reached after skipping jump_no_match entries, or if this value is not given, terminates matching the current table and returns not matched. The head position is always restored to the position it was in before the non-matching command was executed, enabling backtracking.

The format of the command_argument is dependent on the command. It can be a string, a set, a search object, a tuple of some other wild animal from Python land. See the command section below for details.

A table matches a string if and only if the Tagging Engine reaches a table index that lies beyond the end of the table. The engine then returns matched ok. Jumping beyond the start of the table (to a negative table index) causes the table to return with result failed to match.

Context Object Support

Starting with version 2.1.0, the Tagging Engine supports carrying along an optional context object. The object can be used for storing data specific to the tagging procedure, error information, etc.

You can access the context object by using a Python function as tag object which is then called with the context object as last argument if CallTag is used as command flag or in matching functions which are called as a result of the Call or CallArg commands.

To remain backward compatible, the context object is only provided as last argument if given to the tag() function.

New commands which make use of the context object at a lower level will eventually appear in the Tagging Engine in future releases.

Tagging Commands

The commands and constants used here are integers defined in Constants/TagTables.py and imported into the package's root module. For the purpose of explaining the taken actions we assume that the tagging engine was called with tag(text,table,start=0,stop=len(text)). The current head position is indicated by x.

Command Matching Argument Action

Fail Here Causes the engine to fail matching at the current head position.

Jump To Causes the engine to perform a relative jump by jump_no_match entries.

AllIn string Matches all characters found in text[x:stop] up to the first that is not included in string. At least one character must match.

AllNotIn string Matches all characters found in text[x:stop] up to the first that is included in string. At least one character must match.

AllInSet set Matches all characters found in text[x:stop] up to the first that is not included in the string set. At least one character must match. Note: String sets only work with 8-bit text. Use AllInCharSet if you plan to use the tag table with 8-bit and Unicode text.

AllInCharSet CharSet object Matches all characters found in text[x:stop] up to the first that is not included in the CharSet. At least one character must match.

Is character Matches iff text[x] == character.

IsNot character Matches iff text[x] != character.

IsIn string Matches iff text[x] is in string.

IsNotIn string Matches iff text[x] is not in string.

IsInSet set Matches iff text[x] is in set. Note: String sets only work with 8-bit text. Use IsInCharSet if you plan to use the tag table with 8-bit and Unicode text.

IsInCharSet CharSet object Matches iff text[x] is contained in the CharSet.

Word string Matches iff text[x:x+len(string)] == string.

WordStart string Matches all characters up to the first occurance of string in text[x:stop].
If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the first character of string in the found occurance.
At least one character must match.

WordEnd string Matches all characters up to the first occurance of string in text[x:stop].
If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the last character of string in the found occurance.

sWordStart TextSearch object Same as WordStart except that the TextSearch object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed.

sWordEnd TextSearch object Same as WordEnd except that the TextSearch object is used to perform the necessary action (which can be much faster).

sFindWord TextSearch object Uses the TextSearch object to find the given substring.
If found, the tagobj is assigned only to the slice of the substring. The characters leading up to it are ignored.
The head position is adjusted to right after the substring -- just like for sWordEnd.

Call function Calls the matching function(text,x,stop) or function(text,x,stop,context) if a context object was provided to the tag() function call.
The function must return the index y of the character in text[x:stop] right after the matching substring.
The entry is considered to be matching, iff x != y. The engines head is positioned on y in that case.

CallArg (function,[arg0,...]) Same as Call except that function(text,x,stop[,arg0,...]) or function(text,x,stop,[,arg0,...],context) (if a context object is used) is being called.
The command argument must be a tuple.

Table tagtable or ThisTable Matches iff tagtable matches text[x:stop].
This calls the engine recursively.
In case of success the head position is adjusted to point right after the match and the returned taglist is made available in the subtags field of this table's taglist entry.
You may pass the special constant ThisTable instead of a Tag Table if you want to call the current table recursively.

SubTable tagtable or ThisTable Same as Table except that the subtable reuses this table's tag list for its tag list. The subtags entry is set to None.
You may pass the special constant ThisTable instead of a Tag Table if you want to call the current table recursively.

TableInList (list_of_tables,index) Same as Table except that the matching table to be used is read from the list_of_tables at position index whenever this command is executed.
This makes self-referencing tables possible which would otherwise not be possible (since Tag Tables are immutable tuples).
Note that it can also introduce circular references, so be warned !

SubTableInList (list_of_tables,index) Same as TableInList except that the subtable reuses this table's tag list. The subtags entry is set to None.

EOF Here Matches iff the head position is beyond stop. The match recorded by the Tagging Engine is the text[stop:stop].

Skip offset Always matches and moves the head position to x + offset.

Move position Always matches and moves the head position to slice[position]. Negative indices move the head to slice[len(slice)+position+1], e.g. (None,Move,-1) moves to EOF. slice refers to the current text slice being worked on by the Tagging Engine.

JumpTarget Target String Always matches, does not move the head position.
This command is only used internally by the Tag Table compiler, but can also be used for writing Tag Table definitions, e.g. to follow the path the Tagging Engine takes through a Tag Table definition.

Loop count Remains undocumented for this release.

LoopControl Break/Reset Remains undocumented for this release.

The following flags can be added to the command integers above:

CallTag

Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, call tagobj(taglist,text,l,r,subtags) or tagobj(taglist,text,l,r,subtags,context) if a context object was passed to the tag() function.

AppendMatch

Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, append the match found as string.

Note that this will produce non-standard taglists ! It is useful in combination with join() though and can be used to implement smart split() replacements algorithms.

AppendToTagobj

Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, call tagobj.append((None,l,r,subtags)).

AppendTagobj

Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, append tagobj itself.

Note that this can cause the taglist to have a non-standard format, i.e. functions relying on the standard format could fail.

This flag is mainly intended to build join-lists usable by the join()-function (see below).

LookAhead

If this flag is set, the current position of the head will be reset to l (the left position of the match) after a successful match.

This is useful to implement lookahead strategies.

Using the flag has no effect on the way the tagobj itself is treated, i.e. it will still be processed in the usual way.

Some additional constants that can be used as argument or relative jump position:

To: Useful as argument for 'Jump'.
Here: Useful as argument for 'Fail' and 'EOF'.
MatchOk: Jumps to a table index beyond the tables end, causing the current table to immediatly return with 'matches ok'.
MatchFail: Jumps to a negative table index, causing the current table to immediatly return with 'failed to match'.
ToBOF,ToEOF: Useful as arguments for 'Move': (None,Move,ToEOF) moves the head to the character behind the last character in the current slice, while (None,Move,ToBOF) moves to the first character.
ThisTable: Useful as argument for 'Table' and 'SubTable'. See above for more information.

Internally, the Tag Table is used as program for a state machine which is coded in C and accessible through the package as tag() function along with the constants used for the commands (e.g. Allin, AllNotIn, etc.). Note that in computer science one normally differentiates between finite state machines, pushdown automata and turing machines. The Tagging Engine offers all these levels of complexity depending on which techniques you use, yet the basic structure of the engine is best compared to a finite state machine.

Tip: if you are getting an error 'call of a non-function' while writing a table definition, you probably have a missing ',' somewhere in the tuple !

Third Party Tools for Tag Table Writing

Writing these Tag Tables by hand is not always easy. However, since Tag Tables can easily be generated using Python code, it is possible to write tools which convert meta-languages into Tag Tables which then run on all platforms supported by mxTextTools at nearly C speeds.

Mike C. Fletcher has written a nice tools for generating Tag Tables using an EBNF notation. You may want to check out his SimpleParse add-on for mxTextTools.

Recently, Tony J. Ibbs has also started to work in this direction. His meta-language for mxTextTools aims at simplifying the task of writing Tag Table tuples.

More references to third party extensions or applications built on top of mxTextTools can be found in the Add-ons Section.

Debugging

The packages includes a nearly complete Python emulation of the Tagging Engine in the Examples subdirectory (pytag.py). Though it is unsupported it might still provide some use since it has a builtin debugger that will let you step through the Tag Tables as they are executed. See the source for further details.

As an alternative you can build a version of the Tagging Engine that provides lots of debugging output. See mxTextTools/Setup for explanations on how to do this. When enabled the module will create several .log files containing the debug information of various parts of the implementation whenever the Python interpreter is run with the debug flag enabled (python -d). These files should give a fairly good insight into the workings of the Tag Engine (though it still isn't as elegant as it could be).

Note that the debug version of the module is almost as fast as the regular build, so you might as well do regular work with it.

TextSearch Object

The TextSearch object is immutable and usable for one search string per object only. However, once created, the TextSearch objects can be applied to as many text strings as you like -- much like compiled regular expressions. Matching is done exact (doing translations on-the-fly if supported by the search algorithm).

Furthermore, the TextSearch objects can be pickled and implement the copy protocol as defined by the copy module. Comparisons and hashing are not implemented (the objects are stored by id in dictionaries).

Depending on the search algorithm, TextSearch objects can search in 8-bit strings and/or Unicode. Searching in memory buffers is currently not supported. Accordingly, the search string itself may also be an 8-bit string or Unicode.

TextSearch Object Constructors

In older versions of mxTextTools there were two separate constructors for search objects: BMS() for Boyer-Moore and FS() for the (unpublished) FastSearch algorithm. With 2.1.0 the interface was changed to merge these two constructors into one having the algorithm type as parameter.

Note: The FastSearch algorithm is *not* included in the public release of mxTextTools.

TextSearch(match,translate=None,algorithm=default_algorithm)

Create a TextSearch substring search object for the string match implementing the algorithm specified in the constructor.

algorithm defines the algorithm to use. Possible values are:

BOYERMOORE

Enhanced Boyer-Moore-Horspool style algorithm for searching in 8-bit text. Unicode is not supported. On-the-fly translation is supported.

FASTSEARCH

Enhanced Boyer-Moore style algorithm for searching in 8-bit text. This algorithm provides better performance for match patterns having repeating sequences, like e.g. DNA strings. Unicode is not supported. On-the-fly translation is supported.

Not included in the public release of mxTextTools.

TRIVIAL

Trivial right-to-left search algorithm. This algorithm can be used to search in 8-bit text and Unicode. On-the-fly translation is not supported.

algorithm defaults to BOYERMOORE (or FASTSEARCH if available) for 8-bit match strings and TRIVIAL for Unicode match strings.

translate is an optional translate-string like the one used in the module 're', i.e. a 256 character string mapping the oridnals of the base character set to new characters. It is supported by the BOYERMOORE and the FASTSEARCH algorithm only.

This function supports keyword arguments.

BMS(match[,translate])

DEPRECATED: Use TextSearch(match, translate, BOYERMOORE) instead.

FS(match[,translate])

DEPRECATED: Use TextSearch(match, translate, FASTSEARCH) instead.

TextSearch Object Instance Variables

To provide some help for reflection and pickling the TextSearch object gives (read-only) access to these attributes.

match: The string that the search object will look for in the search text.
translate: The translate string used by the object or None (if no translate string was passed to the constructor).
algorithm: The algorithm used by the TextSearch object. For possible values, see the TextSearch() constructor documentation.

TextSearch Object Instance Methods

The TextSearch object has the following methods:

search(text,[start=0,stop=len(text)]): Search for the substring match in text, looking only at the slice [start:stop] and return the slice (l,r) where the substring was found, or (start,start) if it was not found.
find(text,[start=0,stop=len(text)]): Search for the substring match in text, looking only at the slice [start:stop] and return the index where the substring was found, or -1 if it was not found. This interface is compatible with string.find.
findall(text,start=0,stop=len(text)): Same as search(), but return a list of all non-overlapping slices (l,r) where the match string can be found in text.

Note that translating the text before doing the search often results in a better performance. Use string.translate() to do that efficiently.

CharSet Object

The CharSet object is an immutable object which can be used for character set based string operations like text matching, searching, splitting etc.

CharSet objects can be pickled and implement the copy protocol as defined by the copy module as well as the 'in'-protocol, so that c in charset works as expected. Comparisons and hashing are not implemented (the objects are stored by id in dictionaries).

The objects support both 8-bit strings and UCS-2 Unicode in both the character set definition and the various methods. Mixing of the supported types is also allowed. Memory buffers are currently not supported.

CharSet Object Constructor

CharSet(definition)

Create a CharSet object for the given character set definition.

definition may be an 8-bit string or Unicode.

The constructor supports the re-module syntax for defining character sets: "a-e" maps to "abcde" (the backslash can be used to escape the special meaning of "-", e.g. r"a\-e" maps to "a-e") and "^a-e" maps to the set containing all but the characters "abcde".

Note that the special meaning of "^" only applies if it appears as first character in a CharSet definition. If you want to create a CharSet with the single character "^", then you'll have to use the escaped form: r"\^". The non-escape form "^" would result in a CharSet matching all characters.

To add the backslash character to a CharSet you have to escape with itself: r"\\".

Watch out for the Python quoting semantics in these explanations: the small r in front of some of these strings makes the raw Python literal strings which means that no interpretation of backslashes is applied: r"\\" == "\\\\" and r"a\-e" == "a\\-e".

CharSet Object Instance Variables

To provide some help for reflection and pickling the CharSet object gives (read-only) access to these attributes.

definition: The definition string which was passed to the constructor.

CharSet Object Instance Methods

The CharSet object has these methods:

contains(char): Return 1 if char is included in the character set, 0 otherwise.
search(text[, direction=1, start=0, stop=len(text)]): Search text[start:stop] for the first character included in the character set. Returns None if no such character is found or the index position of the found character.
direction defines the search direction: a positive value searches forward starting from text[start], while a negative value searches backwards from text[stop-1].
match(text[, direction=1, start=0, stop=len(text)]): Look for the longest match of characters in text[start:stop] which appear in the character set. Returns the length of this match as integer.
direction defines the match direction: a positive value searches forward starting from text[start] giving a prefix match, while a negative value searches backwards from text[stop-1] giving a suffix match.
split(text, [,start=0, stop=len(text)]): Split text[start:stop] into a list of substrings using the character set definition, omitting the splitting parts and empty substrings.
splitx(text, [,start=0, stop=len(text)]): Split text[start:stop] into a list of substrings using the character set definition, such that every second entry consists only of characters in the set.
strip(text[, where=0, start=0, stop=len(text)]): Strip all characters in text[start:stop] appearing in the character set.
where indicates where to strip (<0: left; =0: left and right; >0: right).

Functions

These functions are defined in the package:


		    tag(text,tagtable,sliceleft=0,sliceright=len(text),taglist=[],context=None)

This is the interface to the Tagging Engine.

text may be an 8-bit string or Unicode. tagtable must be either Tag Table definition (a tuple of tuples) or a compiled TagTable() object matching the text string type. Tag Table definitions are automatically compiled into TagTable() objects by this constructor.

Returns a tuple (success, taglist, nextindex), where nextindex indicates the next index to be processed after the last character matched by the Tag Table.

In case of a non match (success == 0), it points to the error location in text. If you provide a tag list it will be used for the processing.

Passing None as taglist results in no tag list being created at all.

context is an optional extension to the Tagging Engine introduced in version 2.1.0 of mxTextTools. If given, it is made available to the Tagging Engine during the scan and can be used for e.g. CallTag.

This function supports keyword arguments.


		    join(joinlist[,sep='',start=0,stop=len(joinlist)])

This function works much like the corresponding function in module 'string'. It pastes slices from other strings together to form a new string.

The format expected as joinlist is similar to a tag list: it is a sequence of tuples (string,l,r[,...]) (the resulting string will then include the slice string[l:r]) or strings (which are copied as a whole). Extra entries in the tuple are ignored.

The optional argument sep is a separator to be used in joining the slices together, it defaults to the empty string (unlike string.join). start and stop allow to define the slice of joinlist the function will work in.

Important Note: The syntax used for negative slices is different than the Python standard: -1 corresponds to the first character *after* the string, e.g. ('Example',0,-1) gives 'Example' and not 'Exampl', like in Python. To avoid confusion, don't use negative indices.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    cmp(a,b)

Compare two valid taglist tuples w/r to their slice position. This is useful for sorting joinlists and not much slower than sorting integers, since the function is coded in C.


		    joinlist(text,list[,start=0,stop=len(text)])

Produces a joinlist suitable for passing to join() from a list of tuples (replacement,l,r,...) in such a way that all slices text[l:r] are replaced by the given replacement.

A few restrictions apply, though:

the list must be sorted ascending (e.g. using the cmp() as compare function)
it may not contain overlapping slices
the slices may not contain negative indices
if the taglist cannot contain overlapping slices, you can give this function the taglist produced by tag() directly (sorting is not needed, as the list will already be sorted)

If one of these conditions is not met, a ValueError is raised.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    upper(string)

Returns the string with all characters converted to upper case.

Note that the translation string used is generated at import time. Locale settings will only have an effect if set prior to importing the package.

This function is almost twice as fast as the one in the string module.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    lower(string)

Returns the string with all characters converted to lower case. Same note as for upper().

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    is_whitespace(text,start=0,stop=len(text))

Returns 1 iff text[start:stop] only contains whitespace characters (as defined in Constants/Sets.py), 0 otherwise.

This function can handle 8-bit string or Unicode input.


		    replace(text,what,with,start=0,stop=len(text))

Works just like string.replace() -- only faster since a search object is used in the process.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    multireplace(text,replacements,start=0,stop=len(text))

Apply multiple replacement to a text in one processing step.

replacements must be list of tuples (replacement, left, right). The replacement string is then used to replace the slice text[left:right].

Note that the replacements do not affect one another w/r to indexing: indices always refer to the original text string.

Replacements may not overlap. Otherwise a ValueError is raised.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    find(text,what,start=0,stop=len(text))

Works just like string.find() -- only faster since a search object is used in the process.

This function can handle 8-bit string and Unicode input.


		    findall(text,what,start=0,stop=len(text))

Returns a list of slices representing all non-overlapping occurances of what in text[start:stop]. The slices are given as 2-tuples (left,right) meaning that what can be found at text[left:right].

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    collapse(text,separator=' ')

Takes a string, removes all line breaks, converts all whitespace to a single separator and returns the result. Tim Peters will like this one with separator '-'.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    charsplit(text,char,start=0,stop=len(text))

Returns a list that results from splitting text[start:stop] at all occurances of the character given in char.

This is a special case of string.split() that has been optimized for single character splitting running 40% faster.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    splitat(text,char,nth=1,start=0,stop=len(text))

Returns a 2-tuple that results from splitting text[start:stop] at the nth occurance of char.

If the character is not found, the second string is empty. nth may also be negative: the search is then done from the right and the first string is empty in case the character is not found.

The splitting character itself is not included in the two substrings.

This function can handle mixed 8-bit string / Unicode input. Coercion is always towards Unicode.


		    suffix(text,suffixes,start=0,stop=len(text)[,translate])

Looks at text[start:stop] and returns the first matching suffix out of the tuple of strings given in suffixes.

If no suffix is found to be matching, None is returned. An empty suffix ('') matches the end-of-string.

The optional 256 char translate string is used to translate the text prior to comparing it with the given suffixes. It uses the same format as the search object translate strings. If not given, no translation is performed and the match done exact. On-the-fly translation is not supported for Unicode input.

This function can handle either 8-bit strings or Unicode. Mixing these input types is not supported.


		    prefix(text,prefixes,start=0,stop=len(text)[,translate])

Looks at text[start:stop] and returns the first matching prefix out of the tuple of strings given in prefixes.

If no prefix is found to be matching, None is returned. An empty prefix ('') matches the end-of-string.

This function can handle either 8-bit strings or Unicode. Mixing these input types is not supported.


		    splitlines(text)

Splits text into a list of single lines.

The following combinations are considered to be line-ends: '\r', '\r\n', '\n'; they may be used in any combination. The line-end indicators are removed from the strings prior to adding them to the list.

This function allows dealing with text files from Macs, PCs and Unix origins in a portable way.

This function can handle 8-bit string and Unicode input.


		    countlines(text)

Returns the number of lines in text.

Line ends are treated just like for splitlines() in a portable way.

This function can handle 8-bit string and Unicode input.


		    splitwords(text)

Splits text into a list of single words delimited by whitespace.

This function is just here for completeness. It works in the same way as string.split(text). Note that CharSet().split() gives you much more control over how splitting is performed. whitespace is defined as given below (see Constants).

This function can handle 8-bit string and Unicode input.

str2hex(text)

Returns text converted to a string consisting of two byte HEX values, e.g. ',.-' is converted to '2c2e2d'. The function uses lowercase HEX characters.

Unicode input is not supported.

hex2str(hex)

Returns the string hex interpreted as two byte HEX values converted to a string, e.g. '223344' becomes '"3D'. The function expects lowercase HEX characters per default but can also work with upper case ones.

Unicode input is not supported.

isascii(text)

Returns 1/0 depending on whether text only contains ASCII characters or not.

set(string[,logic=1])

DEPRECATED: Use CharSet() instead.

Returns a character set for string: a bit encoded version of the characters occurring in string.

If logic is 0, then all characters not in string will be in the set.

Unicode input is not supported.

invset(string)

DEPRECATED: Use CharSet("^...") instead.

Same as set(string,0).

Unicode input is not supported.

setfind(text,set[,start=0,stop=len(text)])

DEPRECATED: Use CharSet().find() instead.

Find the first occurence of any character from set in text[start:stop]. set must be a string obtained from set().

Unicode input is not supported.

setstrip(text,set[,start=0,stop=len(text),mode=0])

DEPRECATED: Use CharSet().strip() instead.

Strip all characters in text[start:stop] appearing in set. mode indicates where to strip (<0: left; =0: left and right; >0: right). set must be a string obtained with set().

Unicode input is not supported.

setsplit(text,set[,start=0,stop=len(text)])

DEPRECATED: Use CharSet().split() instead.

Split text[start:stop] into substrings using set, omitting the splitting parts and empty substrings. set must be a string obtained from set().

Unicode input is not supported.

setsplitx(text,set[,start=0,stop=len(text)])

DEPRECATED: Use CharSet().splitx() instead.

Split text[start:stop] into substrings using set, so that every second entry consists only of characters in set. set must be a string obtained from set().

Unicode input is not supported.

The TextTools.py also defines some other functions, but these are left undocumented since they may disappear in future releases.

Constants

The package exports these constants. They are defined in Constants/Sets.

Note that Unicode defines many more characters in the following categories. The character sets defined here are restricted to ASCII (and parts of Latin-1) only.

a2z: 'abcdefghijklmnopqrstuvwxyz'
A2Z: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
a2z: 'abcdefghijklmnopqrstuvwxyz'
umlaute: 'äöüß'
Umlaute: 'ÄÖÜ'
alpha: A2Z + a2z
a2z: 'abcdefghijklmnopqrstuvwxyz'
german_alpha: A2Z + a2z + umlaute + Umlaute
number: '0123456789'
alphanumeric: alpha + number
white: ' \t\v'
newline: '\n\r'
formfeed: '\f'
whitespace: white + newline + formfeed
any: All characters from \000-\377
*_charset: All of the above as CharSet() objects.
*_set: All of the above as set() compatible character sets.
tagtable_cache: This the cache dictionary which is used by the TagTable() compiler to store compiled Tag Table definitions. It has a hard limit of 100 entries, but can also be managed by user routines to lower this limit.
BOYERMOORE, FASTSEARCH, TRIVIAL: TextSearch() algorithm values.

Examples of Use

The Examples/ subdirectory of the package contains a few examples of how tables can be written and used. Here is a non-trivial example for parsing HTML (well, most of it):


    from simpleparse.stt.TextTools import *

    error = '***syntax error'			# error tag obj

    tagname_set = set(alpha+'-'+number)
    tagattrname_set = set(alpha+'-'+number)
    tagvalue_set = set('"\'> ',0)
    white_set = set(' \r\n\t')

    tagattr = (
	   # name
	   ('name',AllInSet,tagattrname_set),
	   # with value ?
	   (None,Is,'=',MatchOk),
	   # skip junk
	   (None,AllInSet,white_set,+1),
	   # unquoted value
	   ('value',AllInSet,tagvalue_set,+1,MatchOk),
	   # double quoted value
	   (None,Is,'"',+5),
	     ('value',AllNotIn,'"',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'"'),
	     (None,Jump,To,MatchOk),
	   # single quoted value
	   (None,Is,'\''),
	     ('value',AllNotIn,'\'',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'\'')
	   )

    valuetable = (
	# ignore whitespace + '='
	(None,AllInSet,set(' \r\n\t='),+1),
	# unquoted value
	('value',AllInSet,tagvalue_set,+1,MatchOk),
	# double quoted value
	(None,Is,'"',+5),
	 ('value',AllNotIn,'"',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'"'),
	 (None,Jump,To,MatchOk),
	# single quoted value
	(None,Is,'\''),
	 ('value',AllNotIn,'\'',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'\'')
	)

    allattrs = (# look for attributes
	       (None,AllInSet,white_set,+4),
	        (None,Is,'>',+1,MatchOk),
	        ('tagattr',Table,tagattr),
	        (None,Jump,To,-3),
	       (None,Is,'>',+1,MatchOk),
	       # handle incorrect attributes
	       (error,AllNotIn,'> \r\n\t'),
	       (None,Jump,To,-6)
	       )

    htmltag = ((None,Is,'<'),
	       # is this a closing tag ?
	       ('closetag',Is,'/',+1),
	       # a coment ?
	       ('comment',Is,'!',+8),
		(None,Word,'--',+4),
		('text',sWordStart,BMS('-->'),+1),
		(None,Skip,3),
		(None,Jump,To,MatchOk),
		# a SGML-Tag ?
		('other',AllNotIn,'>',+1),
		(None,Is,'>'),
		    (None,Jump,To,MatchOk),
		   # XMP-Tag ?
		   ('tagname',Word,'XMP',+5),
		    (None,Is,'>'),
		    ('text',WordStart,'</XMP>'),
		    (None,Skip,len('</XMP>')),
		    (None,Jump,To,MatchOk),
		   # get the tag name
		   ('tagname',AllInSet,tagname_set),
		   # look for attributes
		   (None,AllInSet,white_set,+4),
		    (None,Is,'>',+1,MatchOk),
		    ('tagattr',Table,tagattr),
		    (None,Jump,To,-3),
		   (None,Is,'>',+1,MatchOk),
		   # handle incorrect attributes
		   (error,AllNotIn,'> \n\r\t'),
		   (None,Jump,To,-6)
		  )

    htmltable = (# HTML-Tag
		 ('htmltag',Table,htmltag,+1,+4),
		 # not HTML, but still using this syntax: error or inside XMP-tag !
		 (error,Is,'<',+3),
		  (error,AllNotIn,'>',+1),
		  (error,Is,'>'),
		 # normal text
		 ('text',AllNotIn,'<',+1),
		 # end of file
		 ('eof',EOF,Here,-5),
		)

I hope this doesn't scare you away :-) ... it's fast as hell.

Package Structure

[TextTools]
       [Constants]
              Sets.py
              TagTables.py
       Doc/
       [Examples]
              HTML.py
              Loop.py
              Python.py
              RTF.py
              RegExp.py
              Tim.py
              Words.py
              altRTF.py
              pytag.py
       [mxTextTools]
              test.py
       TextTools.py

Entries enclosed in brackets are packages (i.e. they are directories that include a __init__.py file). Ones with slashes are just ordinary subdirectories that are not accessible via import.

The package TextTools imports everything needed from the other components. It is sometimes also handy to do a from simpleparse.stt.TextTools.Constants.TagTables import *.

Examples/ contains a few demos of what the Tag Tables can do.

Optional Add-Ons for mxTextTools

Mike C. Fletcher is working on a Tag Table generator called SimpleParse. It works as parser generating front end to the Tagging Engine and converts a EBNF style grammar into a Tag Table directly useable with the tag() function.

Tony J. Ibbs has started to work on a meta-language for mxTextTools. It aims at simplifying the task of writing Tag Table tuples using a Python style syntax. It also gets rid off the annoying jump offset calculations.

Andrew Dalke has started work on a parser generator called Martel built upon mxTextTools which takes a regular expression grammer for a format and turns the resultant parsed tree into a set of callback events emulating the XML/SAX API. The results look very promising !

Support

eGenix.com is providing commercial support for this package. If you are interested in receiving information about this service please see the eGenix.com Support Conditions.

Copyright & License

This software is covered by the eGenix.com Public License Agreement. The text of the license is also included as file "LICENSE" in the package's main directory.

By downloading, copying, installing or otherwise using the software, you agree to be bound by the terms and conditions of the eGenix.com Public License Agreement.

History & Future

Things that still need to be done:

Add a tag command to match word-in-list. This could also be extended to use multi pattern search objects.
Add a command or feature to allow efficient lookahead. A table will have to be able to return differentiated information about what part of it actually did match. E.g. if the table matches A(B|C|D) and the string is found to match AC, there should be a way for the caller to identify and use that information for further execution. [This should be possible now using the new context object.]
Add a per-call stack and command to manipulate it. This would provide for a way to do recursion without relying on the C stack and also provide a means to implement communication between the different recursive levels (might be of use for the above bullet). [Postponed. The new context object can be used for all such tricks, e.g. by maintaining a list as a stack. More commands which will make use of the context object will eventually appear in future versions.]
Convert some more APIs to use the buffer interface instead of insisting on Python string objects. [Postponed. In fact, in 2.1.0 many APIs which previously supported the buffer interface no longer do. This will probably be fixed in some future version if possible. ]
Use a special list implementation for taglists which resize in larger chunks (e.g. 1024 entries per realloc()). The current scheme implemented in the standard Python list implementation does way to many realloc()s, slowing down the taglist creation considerably. [This still waits to be done. I believe that a special tag list implementation might be needed for this special application. ]

Things that changed from 2.0.3 to 2.1.0:

Version 2.1.0 introduces full Unicode support to mxTextTools and the Tagging Engine. As a result, a few things had to be restructured and modified. Hopefully, the new design decisions will provide more room for future enhancements.

The new version is expected to behave nearly 100% backward compatible to previous versions. If needed, aliases or factory functions were provided to maintain interface compatibility.

Moved command constant definitions from Constants.py to the C extension.
Restructured tag commands and their numbering so that low-level commands come before the special ones. Old tag tables need to be "recompiled" due to this change !
Added a Tag-Table compiler. The tagging engine will now only work with compiled TagTables.
Restored Python 1.5.2 compatibility (all Unicode usages are optional).
Made TE polymorph w/r underlying datatype and created two versions: one for unsigned char and one for Py_UNICODE.
Wrote Tag-Table cache support.
tag() now accepts keyword arguments.
Merged BMS and FS into a new TextSearch object. The used algorithm is now an argument to this single object constructor.
Passing an unknown search object type to the TE is now an error.
Nearly all instances where a SystemError could have been raised now raise an mxTextTools.Error instead.
Removed support for buffer-compatible input objects. This will probably be reintegrated in some future release.
Added new AllInCharSet and IsInCharSet commands.
Implemented Unicode support in search objects using a trivial algorithm. Translation is not supported for Unicode.
Added a huge set of regression tests for all the C APIs and the Tagging Engine.
Fixed a bug in the strip APIs which caused a core dump in situations where the complete string contents would have been stripped. Thanks to Jeffrey Chang for finding this one.
Fixed a bug in the handling of SubTable: the subtaglist entries of the tag table entries pointed recursively to the taglist containing them. This was updated to the documented behaviour of using None for the subtaglist entries.
Added support for a context object which is passed along while processing a tag table with the Tagging Engine.
Added more type casts to the C code to make some pedantic compilers happy (eg. the Mac OS X one).
Fixed a bug found by Simon Cusack in the RTF.py example.
Fixed a bug in tagdict(). Thanks to Joel Rosdahl for reporting this.

Things that changed from 2.0.2 to 2.0.3:

Added isascii().

Things that changed from 2.0.0 to 2.0.2:

Fixed a bug in the Words.py example. Thanks to Michael Husmann for finding this one.
Fixed a memory leak in the CallTag processing.

Things that changed from 1.1.1 to 2.0.0:

Fixed a cast bug in mxTextTools which shows up on Alphas. Thanks to Tony Ibbs for reporting this one.
Changed the semantics of the 'Move' command. It now works relative to the current slice rather than absolute as it did before. As side effect, you can now easily skip back to the first character in the currently processed text slice (note that the 'Table' commands position always work on sub slices of the text slice passed to the tag() function).
Added constant Constants.TagTables.ToBOF.
Changed some internals producing a slight speedup. Converted some of the functions to use the buffer interface instead of string objects.
Fixed a bug that caused the HTML parsers not to detect empty value definitions, e.g. VALUE="". Found by Felix Thibault.
Added multireplace().
Fixed a bug in the code for SubTableInList: it created a new sub tag list even though it should have used the table's tag list.
Fixed a bug in the CALLARG opcode argument handling code. Thanks to Rod Watterworth for spotting this one.
Fixed a typo in the collapse() keyword parameter: seperator -> separator.
Added LookAhead flag. Thanks to Andrew Dalke for inspiring this flag.
Fixed SubTable and SubTableInList to remove any additions to the taglist in case of an unsuccessful match.
Moved the package under a new top-level package 'mx'. It is part of the eGenix.com mx BASE distribution.

Things that changed from 1.1.0 to 1.1.1:

Added a compile time switch for the type code used in parsing input data for the various APIs dealing with text data. It defaults to "s#" meaning that all objects implementing the getreadbuffer interface are useable; this includes text encoding such as Unicode too, so beware of mixing searching pattern object types and text object types.
Fixed a bugglet in the definition of MatchFail. It should be the constant -20000, not -1. Also, there was a bug in the finishing part of the Tagging Engine: jumps to negative table indices did not result in a 'match fail'. Thanks to Tony J. Ibbs for pointing this out.

Things that changed from 1.0.2 to 1.1.0:

Added MatchFail jump offset.
Added suffix() and prefix().
Fixed the debugging output so that it will print to several .log-files instead of stdout.
Changed the search objects to make them work on any type that supports the buffer protocol, e.g. memory mapped files. The Tagging Engine and the other functions still insist on real Python string objects.
Changed join() to accept any sequence as joinlist, not just Python lists.
Made the two search objects pickleable, copyable and added instance variables .match and .translate.
Added start and stop optional arguments to join().
Added AppendMatch flag.
Added splitlines(), countlines(), str2hex() and hex2str().
Added splitwords().
Added SubTableInList command and compactified the Tagging Engine a bit.
Added setstrip().
Changed the compile time flag MAL_PYTHON to MAL_DEBUG_WITH_PYTHON and hacked up Setup.in a little.

Things that changed from 1.0.1 to 1.0.2:

Fixed some of the undocumented printing functions.
Added Tim.py example for dynamic programming using Tag Tables.
Tuned the Tagging Engine a little more. Added optimizations to TextTools.join(). It is faster then string.join() now (but only excepts real Python lists as input).
Added collapse(). Tim Peters will like this one...
Tuned setsplit, setsplitx and joinlist somewhat. The performance is now comparable to string.split (for tasks producing the same output).
Added charsplit() and splitat().
Fixed a bug in join() that prevented the function from returning '' for empty lists. It raised a SystemError instead.
Added better exception reporting to the tagging engine. Errors are now reported together with the index of the Tag Table entry that caused the exception.
Fixed and reformatted included debugging support. If you want the C engine to be very verbose about what it's doing, compile the engine using '-DMAL_DEBUG -DMAL_PYTHON'. If you run the Python interpreter with '-d' option, the engine will print tons of information to stdout, e.g. "python -d Examples/HTML.py Doc/mxTextTools.html". The engine remains silent without the -d switch.
Added special ThisTable constant to simplify writing recursive Tag Tables.

Things that changed from 1.0.0 to 1.0.1:

Added new functions find() and findall().
Fixed a few quirks that caused compilation problems on Windows. Eliminated the dependency on hack.py in TextTools.py and some of the examples.
Added a compiled Windows PYD-file of the C extension. Thanks to Gordon McMillan for providing it and pointing out a couple of portability bugs.
Added instructions on how to build the C extension under WinXX courtesy of Gordon McMillan.
Added some type casts to make CodeWarrior/Mac happy. Thanks to Just van Rossum for this hint.

Things that changed from the really old TagIt module version 0.7 to mxTextTools 1.0.0:

Added lots of new commands, fixed some bugs, added documentation and wrapped everything into a package.
Added character set handling routines and search objects.

Command	Matching Argument	Action
Fail	Here	Causes the engine to fail matching at the current head position.
Jump	To	Causes the engine to perform a relative jump by `jump_no_match` entries.
AllIn	string	Matches all characters found in `text[x:stop]` up to the first that is not included in string. At least one character must match.
AllNotIn	string	Matches all characters found in `text[x:stop]` up to the first that is included in string. At least one character must match.
AllInSet	set	Matches all characters found in `text[x:stop]` up to the first that is not included in the string set. At least one character must match. Note: String sets only work with 8-bit text. Use `AllInCharSet` if you plan to use the tag table with 8-bit and Unicode text.
AllInCharSet	CharSet object	Matches all characters found in `text[x:stop]` up to the first that is not included in the CharSet. At least one character must match.
Is	character	Matches iff `text[x] == character`.
IsNot	character	Matches iff `text[x] != character`.
IsIn	string	Matches iff `text[x] is in string`.
IsNotIn	string	Matches iff `text[x] is not in string`.
IsInSet	set	Matches iff `text[x] is in set`. Note: String sets only work with 8-bit text. Use `IsInCharSet` if you plan to use the tag table with 8-bit and Unicode text.
IsInCharSet	CharSet object	Matches iff `text[x]` is contained in the CharSet.
Word	string	Matches iff `text[x:x+len(string)] == string`.
WordStart	string	Matches all characters up to the first occurance of string in `text[x:stop]`. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the first character of string in the found occurance. At least one character must match.
WordEnd	string	Matches all characters up to the first occurance of string in `text[x:stop]`. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the last character of string in the found occurance.
sWordStart	TextSearch object	Same as WordStart except that the TextSearch object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed.
sWordEnd	TextSearch object	Same as WordEnd except that the TextSearch object is used to perform the necessary action (which can be much faster).
sFindWord	TextSearch object	Uses the TextSearch object to find the given substring. If found, the tagobj is assigned only to the slice of the substring. The characters leading up to it are ignored. The head position is adjusted to right after the substring -- just like for sWordEnd.
Call	function	Calls the matching `function(text,x,stop)` or `function(text,x,stop,context)` if a context object was provided to the `tag()` function call. The function must return the index `y` of the character in `text[x:stop]` right after the matching substring. The entry is considered to be matching, iff `x != y`. The engines head is positioned on `y` in that case.
CallArg	(function,[arg0,...])	Same as Call except that `function(text,x,stop[,arg0,...])` or `function(text,x,stop,[,arg0,...],context)` (if a `context` object is used) is being called. The command argument must be a tuple.
Table	tagtable or ThisTable	Matches iff tagtable matches `text[x:stop]`. This calls the engine recursively. In case of success the head position is adjusted to point right after the match and the returned taglist is made available in the subtags field of this table's taglist entry. You may pass the special constant `ThisTable` instead of a Tag Table if you want to call the current table recursively.
SubTable	tagtable or ThisTable	Same as Table except that the subtable reuses this table's tag list for its tag list. The `subtags` entry is set to None. You may pass the special constant `ThisTable` instead of a Tag Table if you want to call the current table recursively.
TableInList	(list_of_tables,index)	Same as Table except that the matching table to be used is read from the `list_of_tables` at position `index` whenever this command is executed. This makes self-referencing tables possible which would otherwise not be possible (since Tag Tables are immutable tuples). Note that it can also introduce circular references, so be warned !
SubTableInList	(list_of_tables,index)	Same as TableInList except that the subtable reuses this table's tag list. The `subtags` entry is set to `None`.
EOF	Here	Matches iff the head position is beyond `stop`. The match recorded by the Tagging Engine is the `text[stop:stop]`.
Skip	offset	Always matches and moves the head position to `x + offset`.
Move	position	Always matches and moves the head position to `slice[position]`. Negative indices move the head to `slice[len(slice)+position+1]`, e.g. (None,Move,-1) moves to EOF. `slice` refers to the current text slice being worked on by the Tagging Engine.
JumpTarget	Target String	Always matches, does not move the head position. This command is only used internally by the Tag Table compiler, but can also be used for writing Tag Table definitions, e.g. to follow the path the Tagging Engine takes through a Tag Table definition.
Loop	count	Remains undocumented for this release.
LoopControl	Break/Reset	Remains undocumented for this release.