On the Making of Martel

  The original idea for this program started in late 1997 when I was
working at Molecular Applications Group in Palo Alto, CA.  I was
developing 'DiscoveryBase,' which was a Perl-based bioinformatics
server for a biotech's intranet.  It provided database searches for
PDB, PIR, SwissProt, GenBank as well as interactive interface to
BLAST, FASTA, DSC and Prosite, with more to be added later.

  We wanted to support three different types of interfaces in
DiscoveryBase.  The first was a transformation to HTML, so
bioinformaticians could see the marked-up version of a database record
or a program output.

  The second was a processed output for use by GeneMine, MAG's
flagship product.  GeneMine had a component which was a specialized
web client for sequence analysis data.  It was what is now called a
"screen scraper" because it automatically went to various web sites to
analyze sequence data then parsed the output HTML to extract the
actual data.

  The third was a programming layer, to allow scientists ready access
to the different database formats and program outputs.  For example,
it would allowed the MAG researches to write an interative BLAST in
less than a dozen lines of code.

  So I developed an event-based parser which could handle all three
different approaches.  I had a line-based tokenizer which would
recognize and label each line of the file.  Each line turned into an
event with two parameters; the event name and the text of the line.

  For example, part of the BLAST events would look like:

    &callback("hit_query", "Query 1  MQENISVTDSYST ...");
    &callback("hit_homology", "+L ++L+IYD K +  ...");
    &callback("hit_subject", "Sbjct 29 LLGQVLEIYDQK ...");

  In addition, I supported what I called "synthetic events", which are
named events but not corresponding to a line of data.  All of these
events started with two underscores, like 

  &callback(" __begin");
  &callback("preheader, "");
  &callback(" __begin_header");
  &callback("program", "BLASTP 2.0.2 [Sep-3-1997]");
  &callback("header", "Reference: Altschul, Stephen F., ...");
  &callback(" __end_header");
  ...

The synthetic events were used to tell when to start and stop
accumultating data, eg, for highlighting which parts of an alignment
pair are identical.  You can see how this eventually became the
matched (?P<group>names), which serves the same purpose.

  I described this parsing style on bionet.software in spring of 1998
(from my ks.uiuc.edu account, if you can dig up the archived mail).

  There were a couple of problems with this approach.  First off,
because the fundamental parsing level was a line, the semantic
extraction and the physical conversion functions sometimes had to
duplicate work, as when parsing database names in the FASTA header
line.

  In general, I also assumed that you would always want to extract
data from the record, so the line lexer part and the semantic parser
(which was stored in a Perl anonymous hash) were one and the same.
The downside with this was that indexing a database was slow because
everything in the database was parsed.

  But even with all those problems, it was a very nice system to use,
and gave us huge flexibility in how we could handle our data.

  While this was all going on, in about February of 1998 I came across
XML and some mention of SAX parsing.  This seemed to be similar to
what I was working on, so I kept in in the back of my mind.

  Later that year I started working for Bioreason.  We needed to parse
data, but only for semantic extraction and not for markup.  So I
didn't write generic parser for them.  On the other hand, I did have
to write a PDB parser -- my 6th or so! -- so I decided I wanted to
stop the need to write any more.

  In summer of 1998, I released UPDB, which read the PDB format
description file (as a text document) and produced PDB parsers for
Python, Perl and Tcl.  It showed that I could write a language neutral
description of a file format and use it with several different
languages.  I announced the early (0.5) version on the PDB mailing
list.  I do know a few people used it because it did provide an
easy-to-understand description of all the cards in the PDB.

  While doing UPDB, I started working on an idea not just for support
for different languages, but different styles of support for the same
language.  Hand-written parsers for a format often skip over data, and
don't verify (for example) that all the places that are supposed to be
blank really are blank.  That is, they assume the file format is
valid.  Machine generated parsers (like from lex/yacc) don't make that
assumption and instead verify every character.  By ignoring some of
those extra checks, it should be possible to generate an even faster
parser automatically.  So I wrote some generator programs which
supported a 'lenient' as compared to 'strict' mode.

  All through this time, I'd been having a conversation with Jeff
Chang (now biopython.org lead - I convinced him to learn Python when
he worked at MAG :) on how to generalize this parsing style.  Our
first model stayed with the line-oriented viewpoint, where you only
need to look ahead a single line (or rarely two) to figure out which
the line was supposed to be.  The lines themselves are arranged in
structures which turn out to be regular grammers, but at the line
token level instead of the word token one.

  Jeff developed this approach further and wrote what is now the
biopython scanner/consumer model.

  I quit Bioreason in fall of 1999 and spent some time working on
PyDaylight.  One of the Daylight packages is Merlin, which is a
chemical database for certain types of searches.  To simplify
interaction, they provide MCL, which is the Merlin Control Language.
I wanted people to migrate to PyDaylight, so amoung other things, I
wrote a translater from MCL to Python code using PyDaylight.  The
first parse was written by hand, but it turns out MCL is implemented
using lex/yacc, so I could rewrite the parser using more powerful
tools.  In this case, John Aycock's SPARK parser system, which I had
heard about in the 1998 Python conference.

  This was a joy to use, and an indicator that there are easier ways
to write parsers than in lex/yacc.  Also, although I didn't recognize
it at the time, one of the things which made it easy was its use of
the re module (perl5 regular expressions).  They are both more
powerful than lex's and more standard.

  At the same time, I was reading through a parsing book (ref???) and
looking for existing parser generatation programs which could do what
I wanted.  One of these was Marc-Andre Lemburg's mxTextTools, which
isn't so much as a parser generator as a system for parsing.  It had
been mentioned on the Python newsgroups as being quite fast, although
I hadn't heard about people using it for much.

  So in December 1999 while visiting my sister's family, I started
learning mxTextTools.  It was confusing at first, but managed to
figure out most of the basics.  One of the things I wanted to do was
support regular expressions, since that is the de facto language for
describing strings.  I started writing an mxTextTools parser for
regular expressions, but stopped part-way through, since making state
tables by hand is rather tedious, and because I didn't like
mxTextTool's lack of support for a state object I could pass around.
Instead, it stored state in global variables.

  I rewrote UPDB in January 2000 while on a beach house in Florida.
It now stores the description in a XML file, which makes it more
accessible to other langauges, and gave me a chance to start learning
XML and SAX.  When I was at the Python conference a few weeks later, I
got Andrew Kuchling to look over my XML code and assure me I was using
it the right way.  I haven't yet released this code, though the XML
file was used to create the PDB 2.1 reader distributed with this
package.

  In Feburary 2000, I worked on PyDaylight some more, and added
support for Thor data trees.  Instead of mapping the tree interface
calls directly to PyDaylight functions, I decided to use a SAX-like
interface to traverse the tree and send appropriate callback events.
This turned out very well -- well enough that I hacked up a TDT to XML
converter and back over lunch break at the MUG conference.

  In the meanwhile, I was thinking about writing this parser system as
some sort of DFA or NFA, since they are equivalent to regular
expressions.  In my mind I saw nodes with transitions for each
recognized line.  Each node would have a list of line descriptions
coming out of them, each pointing to their target node.  In addition,
there were a set of triggers to call when entering or leaving the
node.  These were the same as my synthetic events back at MAG.

  I was going to use regular expressions to build up the DFA, so I
neded an interface for manipulating them.  The easiest one I saw was
Plex, from Greg Ewing.  It provides both a converter from regular
expression to an expression tree, and functions to combine expression
trees into more complicated expressions.  My biggest concern with Plex
was its speed, since it's written in pure Python, unlike mxTextTools
which uses a C extension.

  I wrote up a set of requirements for myself and mailed to Jeff to
review and sent a copy to Greg asking if Plex was appropriate.  This
was around the end of March.  I got encouragement, but it didn't look
like Plex was appropriate because of performance.

  So I decided to start writing a C extension.  I knew that Fredrick
Lundh was writing the new sre module for Python 1.6, so I started
looking at the code.  One thing which made me very happy was that the
regular expression parsing was done in 'sre_parse.py' - a Python
module rather than C code.  This let me hack up a regular expression
converter very easily, for a prot.  Looking at my back email logs, my first
mention of using named groups was before 12 Apr 2000 because that's
when I asked Fredrick if there was any way to store a list of all
group positions with a given name, rather than only the last.

  On 24 Apr 2000 I sent email to Fredrick saying
> What I'm currently planning on doing is have a front-end which is a
> merger of regular expressions and the Plex interface, and use it to
> generate an mxTextTools parser for the back-end.
so that's a good date for when things started to settle down.

  I looked over the sre code and it seemed rather simple, so I figured
I could write my own if needed.  I really thought I needed to since I
wanted to have "early" events, that is, have the beginElement event
message sent if there was no chance of backtracing, even if the
endElement hadn't yet been matched.  I had thrown out the idea of
synthetic events since all I really wanted was paired begin/end block
indicators.

  About this point I decided to ask for help, so I sent messages to
comp.lang.python and comp.compilers outlining what I was looking for.
I also started digging through JavaCC and ANTLR, but that didn't pan
out very well for my requirements.

  I was confident enough that I sent in an abstract to the
Bioinformatics Open Source Conference 200 about my having code to
present at the conference, which was on Aug 17-18th.

  While this was all happening, I became a consultant, so I didn't
have much free time for working on this project.  August rolls along,
so I started working on it.  I'd spent several days over at
Daylight/Metaphorics bugging Roger Sayle about regular expression
parsing, since that's what he did for his graduate work.  It turns out
(while comparing notes on Prosite interpreters) that he knows a lot
about DFAs but very little about grouping, non-greedy repeats,
lookahead assertions and other features of modern regular expression
engines.  That's partially because you can't implement some of those
features in a DFA and hence modern regular expressions aren't really
regular.

  So I start working on DFA approaches - drawing more and more
complicated diagrams on the board to handle grouping and backtraces.
Finally, I give up (which is good, since it can't be done with a DFA,
accoring to Friedl's Understanding Regular Expressions book) and start
thinking about the NFA approach.  That's much easier, but I needed to
figure out how to do backtracking.  I had the idea of using a state
table to describe the motions of an abstract machine.  A state table
to describe the motions of an abstract machine...

  I thought about that for a bit and realized I really was thinking of
mxTextTools and that faster implemenations really weren't easily
available.  So that evening I hacked up something to take the output
of sre_parse.parse from 1.6b1 and make a table for mxTextTools.  And
it worked.  I posted my success to Python's string-SIG on the morning
of Aug 2nd.

  There were some problems with it.  The code really was a hack, and
required you to build up a single large string to sre_parse.parse.  If
there was a bug in the expression, it was really hard to track down,
and no easy way to break it apart to test its components.  I turned it
into a package, started a rewrite to have an Expression tree, and
added more support for items like lookahead assertions.  On the 8th I
was going to be vising the Theoretical Biophysics Group at UIUC so
offered to give an informal talk on my work to date.  I was able to
get a pretty complete SwissProt parser done, with enough to show
working details and some very rough timing numbers, which were rather
better than I had hoped.

  I worked on bug fixes and added PDB and BLAST support over the
weekend and a Plex-like front end; releasing the 0.1 version on the
14th.  Talked about it some at Daylight and started adding MDL CT file
support, which needs a repeat count based on a previously matched
field (like "(?P<count>\d+)\n(.*\n){count}").

  Wrote the poster on the 16th.  Presented it at BOSC the 17th and
18th where it was generally well received.  Took a couple of days off
from coding.  Added group references and the experimental {count}
option on the 21st.  Released version 0.2 on the 22nd, which happens
to be my 30th birthday present to the world.  :)