On the Making of Martel The original idea for this program started in late 1997 when I was working at Molecular Applications Group in Palo Alto, CA. I was developing 'DiscoveryBase,' which was a Perl-based bioinformatics server for a biotech's intranet. It provided database searches for PDB, PIR, SwissProt, GenBank as well as interactive interface to BLAST, FASTA, DSC and Prosite, with more to be added later. We wanted to support three different types of interfaces in DiscoveryBase. The first was a transformation to HTML, so bioinformaticians could see the marked-up version of a database record or a program output. The second was a processed output for use by GeneMine, MAG's flagship product. GeneMine had a component which was a specialized web client for sequence analysis data. It was what is now called a "screen scraper" because it automatically went to various web sites to analyze sequence data then parsed the output HTML to extract the actual data. The third was a programming layer, to allow scientists ready access to the different database formats and program outputs. For example, it would allowed the MAG researches to write an interative BLAST in less than a dozen lines of code. So I developed an event-based parser which could handle all three different approaches. I had a line-based tokenizer which would recognize and label each line of the file. Each line turned into an event with two parameters; the event name and the text of the line. For example, part of the BLAST events would look like: &callback("hit_query", "Query 1 MQENISVTDSYST ..."); &callback("hit_homology", "+L ++L+IYD K + ..."); &callback("hit_subject", "Sbjct 29 LLGQVLEIYDQK ..."); In addition, I supported what I called "synthetic events", which are named events but not corresponding to a line of data. All of these events started with two underscores, like &callback(" __begin"); &callback("preheader, ""); &callback(" __begin_header"); &callback("program", "BLASTP 2.0.2 [Sep-3-1997]"); &callback("header", "Reference: Altschul, Stephen F., ..."); &callback(" __end_header"); ... The synthetic events were used to tell when to start and stop accumultating data, eg, for highlighting which parts of an alignment pair are identical. You can see how this eventually became the matched (?Pnames), which serves the same purpose. I described this parsing style on bionet.software in spring of 1998 (from my ks.uiuc.edu account, if you can dig up the archived mail). There were a couple of problems with this approach. First off, because the fundamental parsing level was a line, the semantic extraction and the physical conversion functions sometimes had to duplicate work, as when parsing database names in the FASTA header line. In general, I also assumed that you would always want to extract data from the record, so the line lexer part and the semantic parser (which was stored in a Perl anonymous hash) were one and the same. The downside with this was that indexing a database was slow because everything in the database was parsed. But even with all those problems, it was a very nice system to use, and gave us huge flexibility in how we could handle our data. While this was all going on, in about February of 1998 I came across XML and some mention of SAX parsing. This seemed to be similar to what I was working on, so I kept in in the back of my mind. Later that year I started working for Bioreason. We needed to parse data, but only for semantic extraction and not for markup. So I didn't write generic parser for them. On the other hand, I did have to write a PDB parser -- my 6th or so! -- so I decided I wanted to stop the need to write any more. In summer of 1998, I released UPDB, which read the PDB format description file (as a text document) and produced PDB parsers for Python, Perl and Tcl. It showed that I could write a language neutral description of a file format and use it with several different languages. I announced the early (0.5) version on the PDB mailing list. I do know a few people used it because it did provide an easy-to-understand description of all the cards in the PDB. While doing UPDB, I started working on an idea not just for support for different languages, but different styles of support for the same language. Hand-written parsers for a format often skip over data, and don't verify (for example) that all the places that are supposed to be blank really are blank. That is, they assume the file format is valid. Machine generated parsers (like from lex/yacc) don't make that assumption and instead verify every character. By ignoring some of those extra checks, it should be possible to generate an even faster parser automatically. So I wrote some generator programs which supported a 'lenient' as compared to 'strict' mode. All through this time, I'd been having a conversation with Jeff Chang (now biopython.org lead - I convinced him to learn Python when he worked at MAG :) on how to generalize this parsing style. Our first model stayed with the line-oriented viewpoint, where you only need to look ahead a single line (or rarely two) to figure out which the line was supposed to be. The lines themselves are arranged in structures which turn out to be regular grammers, but at the line token level instead of the word token one. Jeff developed this approach further and wrote what is now the biopython scanner/consumer model. I quit Bioreason in fall of 1999 and spent some time working on PyDaylight. One of the Daylight packages is Merlin, which is a chemical database for certain types of searches. To simplify interaction, they provide MCL, which is the Merlin Control Language. I wanted people to migrate to PyDaylight, so amoung other things, I wrote a translater from MCL to Python code using PyDaylight. The first parse was written by hand, but it turns out MCL is implemented using lex/yacc, so I could rewrite the parser using more powerful tools. In this case, John Aycock's SPARK parser system, which I had heard about in the 1998 Python conference. This was a joy to use, and an indicator that there are easier ways to write parsers than in lex/yacc. Also, although I didn't recognize it at the time, one of the things which made it easy was its use of the re module (perl5 regular expressions). They are both more powerful than lex's and more standard. At the same time, I was reading through a parsing book (ref???) and looking for existing parser generatation programs which could do what I wanted. One of these was Marc-Andre Lemburg's mxTextTools, which isn't so much as a parser generator as a system for parsing. It had been mentioned on the Python newsgroups as being quite fast, although I hadn't heard about people using it for much. So in December 1999 while visiting my sister's family, I started learning mxTextTools. It was confusing at first, but managed to figure out most of the basics. One of the things I wanted to do was support regular expressions, since that is the de facto language for describing strings. I started writing an mxTextTools parser for regular expressions, but stopped part-way through, since making state tables by hand is rather tedious, and because I didn't like mxTextTool's lack of support for a state object I could pass around. Instead, it stored state in global variables. I rewrote UPDB in January 2000 while on a beach house in Florida. It now stores the description in a XML file, which makes it more accessible to other langauges, and gave me a chance to start learning XML and SAX. When I was at the Python conference a few weeks later, I got Andrew Kuchling to look over my XML code and assure me I was using it the right way. I haven't yet released this code, though the XML file was used to create the PDB 2.1 reader distributed with this package. In Feburary 2000, I worked on PyDaylight some more, and added support for Thor data trees. Instead of mapping the tree interface calls directly to PyDaylight functions, I decided to use a SAX-like interface to traverse the tree and send appropriate callback events. This turned out very well -- well enough that I hacked up a TDT to XML converter and back over lunch break at the MUG conference. In the meanwhile, I was thinking about writing this parser system as some sort of DFA or NFA, since they are equivalent to regular expressions. In my mind I saw nodes with transitions for each recognized line. Each node would have a list of line descriptions coming out of them, each pointing to their target node. In addition, there were a set of triggers to call when entering or leaving the node. These were the same as my synthetic events back at MAG. I was going to use regular expressions to build up the DFA, so I neded an interface for manipulating them. The easiest one I saw was Plex, from Greg Ewing. It provides both a converter from regular expression to an expression tree, and functions to combine expression trees into more complicated expressions. My biggest concern with Plex was its speed, since it's written in pure Python, unlike mxTextTools which uses a C extension. I wrote up a set of requirements for myself and mailed to Jeff to review and sent a copy to Greg asking if Plex was appropriate. This was around the end of March. I got encouragement, but it didn't look like Plex was appropriate because of performance. So I decided to start writing a C extension. I knew that Fredrick Lundh was writing the new sre module for Python 1.6, so I started looking at the code. One thing which made me very happy was that the regular expression parsing was done in 'sre_parse.py' - a Python module rather than C code. This let me hack up a regular expression converter very easily, for a prot. Looking at my back email logs, my first mention of using named groups was before 12 Apr 2000 because that's when I asked Fredrick if there was any way to store a list of all group positions with a given name, rather than only the last. On 24 Apr 2000 I sent email to Fredrick saying > What I'm currently planning on doing is have a front-end which is a > merger of regular expressions and the Plex interface, and use it to > generate an mxTextTools parser for the back-end. so that's a good date for when things started to settle down. I looked over the sre code and it seemed rather simple, so I figured I could write my own if needed. I really thought I needed to since I wanted to have "early" events, that is, have the beginElement event message sent if there was no chance of backtracing, even if the endElement hadn't yet been matched. I had thrown out the idea of synthetic events since all I really wanted was paired begin/end block indicators. About this point I decided to ask for help, so I sent messages to comp.lang.python and comp.compilers outlining what I was looking for. I also started digging through JavaCC and ANTLR, but that didn't pan out very well for my requirements. I was confident enough that I sent in an abstract to the Bioinformatics Open Source Conference 200 about my having code to present at the conference, which was on Aug 17-18th. While this was all happening, I became a consultant, so I didn't have much free time for working on this project. August rolls along, so I started working on it. I'd spent several days over at Daylight/Metaphorics bugging Roger Sayle about regular expression parsing, since that's what he did for his graduate work. It turns out (while comparing notes on Prosite interpreters) that he knows a lot about DFAs but very little about grouping, non-greedy repeats, lookahead assertions and other features of modern regular expression engines. That's partially because you can't implement some of those features in a DFA and hence modern regular expressions aren't really regular. So I start working on DFA approaches - drawing more and more complicated diagrams on the board to handle grouping and backtraces. Finally, I give up (which is good, since it can't be done with a DFA, accoring to Friedl's Understanding Regular Expressions book) and start thinking about the NFA approach. That's much easier, but I needed to figure out how to do backtracking. I had the idea of using a state table to describe the motions of an abstract machine. A state table to describe the motions of an abstract machine... I thought about that for a bit and realized I really was thinking of mxTextTools and that faster implemenations really weren't easily available. So that evening I hacked up something to take the output of sre_parse.parse from 1.6b1 and make a table for mxTextTools. And it worked. I posted my success to Python's string-SIG on the morning of Aug 2nd. There were some problems with it. The code really was a hack, and required you to build up a single large string to sre_parse.parse. If there was a bug in the expression, it was really hard to track down, and no easy way to break it apart to test its components. I turned it into a package, started a rewrite to have an Expression tree, and added more support for items like lookahead assertions. On the 8th I was going to be vising the Theoretical Biophysics Group at UIUC so offered to give an informal talk on my work to date. I was able to get a pretty complete SwissProt parser done, with enough to show working details and some very rough timing numbers, which were rather better than I had hoped. I worked on bug fixes and added PDB and BLAST support over the weekend and a Plex-like front end; releasing the 0.1 version on the 14th. Talked about it some at Daylight and started adding MDL CT file support, which needs a repeat count based on a previously matched field (like "(?P\d+)\n(.*\n){count}"). Wrote the poster on the 16th. Presented it at BOSC the 17th and 18th where it was generally well received. Took a couple of days off from coding. Added group references and the experimental {count} option on the 21st. Released version 0.2 on the 22nd, which happens to be my 30th birthday present to the world. :)