@** Release history. \font\df=cmbx12 \def\date#1{{\medskip\noindent\df #1\medskip}} \parskip=1ex \parindent=0pt \date{Release 0.1: November 2002} Initial release. \date{Release 1.0: February 2003} First production release. @** Development log. \date{2002 August 28} Created development tree and commenced implementation. \date{2002 September 1} Release 0.1 circulated for review. \date{2002 September 6} Added the ability to compute descriptive statistics of the dictionary built by parsing the \.{--mail} and \.{--junk} folders, using the facilities of the \.{statlib.w} program. Statistics are written to standard output. Added a \.{--plot} option to plot a histogram of words in a newly parsed dictionary (not a lookup dictionary loaded with \.{--read}). Creating the plot requires the \.{GNUPLOT} and \.{PBMPlus} utilities to be installed. \date{2002 September 7} Well, after a huge amount of hunkering down and twiddling, parsing of MIME multi-part messages and decoding of parts encoded in \.{Base64} and \.{Quoted-Printable} encoding now seems to be working. This drastically improves the quality of parsing, particularly for junk where these forms of encoding are used as ``stealth'' to evade other content-based filters. \date{2002 September 8} Added the ability to read mail folders compressed with \.{gzip} or other compressors detected by the Autoconf script. This saves a lot of space when you're keeping large training archives around. This will work only on systems with suitable decompressors and the |popen| facility. \date{2002 September 9} Added the \.{--pdiag} option to write the parser diagnostics to a designated file. Previously this was controlled by a gnarly |#define|. Added a ``\.{X-Annoyance-Filter-Decoder}'' line to the \.{--pdiag} output to indicate the activation of decoders (including the sink) for MIME parts in the message. These lines are not seen by the token parser. Fixed a bug in parsing of tokens including ISO accented characters$\ldots$signed characters strike again. \date{2002 September 10} Added a \.{--ptrace} option to include the actual tokens parsed as indented, quoted lines following each line of parser input in the \.{--pdiag} file. Added code to |classifyMessage| which appends lines to the message header in the \.{--pdiag} file giving the aggregate junk probability and the most significant words and their individual probabilities. Separated the mail and junk thresholds, which may now be set independently by the \.{--threshjunk} and \.{--threshmail} options. The \.{--classify} command now writes ``\.{INDT}'' (for ``indeterminate'') if a message falls between the two thresholds and exits with a return status of 4. Added the \.{--binwrite} and \.{--binread} options to export and import a |dictionary| as a portable (assuming IEEE floating point on all platforms) binary file. This will permit easier distribution of dictionary databases and may be faster to load than the |lookupDictionary|. Added the \.{--clearjunk} and \.{--clearmail} options to clear counts of junk and mail. This can be used, in conjunction with the \.{binwrite} option, to prepare databases for use by folks who do not wish to prepare their own. \date{2002 September 11} Added the ability to enforce minimum and maximum length constraints on tokens returned by |tokenParser|. The limits are set to accept tokens from 1 to 255 characters in the |tokenParser| constructor, and may be changed at any time with the |setTokenLengthLimits| method. Note that the length limits are not reset by a call to |setSource|. Set the default token parser length limits to accept tokens between 1 and 64 characters. This will doubtless be the subject of yet more command line options before long. Modified the code which decides whether a mail folder is compressed to check for the argument being a symbolic link. If so, the link target is tested for the extension indicating a compressed file. I only follow links one level---if this poses a problem, your life is probably too complicated. Fixed computation of probability to avoid crashes if no words are present in a category. Probabilities don't make any sense in such circumstances, but you may wish to create such a database for use with \.{--binread}. Added logic to |dictionary :: exportToBinaryFile| and |dictionary::exportToBinaryFile| to save and restore the count of messages contributing to the dictionary in the |messageCount| array in a pseudo-word called ``\.{\ COUNTS\ }'' (obligatorily) at the start of the dictionary. These counts are required should we need to recompute the probability subsequent to loading the dictionary. Added the \.{--newword} and.{--sigwords} options to specify the probability given to words in a message which don't appear in the dictionary and the number of ``most significant'' words whose probabilities are used to determine the aggregate probability a given message is junk. \date{2002 September 12} Added logic to cope with the body of a message being encoded in a \.{Content-TransferEncoding}. While processing the header, this and the \.{Content-Type} are parsed as in MIME headers, with their arguments saved in |bodyContentType|, |bodyContentTypeCharset|, and |bodyContentTransferEncoding|. At the end of the header, if a |bodyContentTransferEncoding| has been specified, the values are transferred to the corresponding |mime|$\ldots$ variables and |multiPart| is set with an end terminator of the null string. The latter disables the decoder's test for a part end sentinel and the warning for an unterminated part. Messages with \.{Subject} lines which contain ISO~8859 encoded characters employ a form of \.{Quoted-Printable} encoding to permit these characters to appear in a mail header where only 7 bit ASCII is permitted. I added code to |mailFolder| to detect these lines and call a new |decodeEscapedText| method of |quotedPrintableMIMEdecoder| to decode them if properly formed. This will permit parsing of ISO subject lines, which may prove critical in discriminating among messages with very short body copy. Yikes! As far as I can determine from the RFCs, what we're supposed to do with continued header lines is just concatenate them, discarding all white space on the continuation even if this runs together tokens on adjacent lines. At least, if you don't do this, encoded words split across continued \.{Subject} lines end up with nugatory white space in the middle. So, I fixed |@| to ``work this way''. Given our definition of tokens, it's likely to fix more things than it breaks anyway. Added documentation to the \.{CWEB} file for yesterday's new options. \date{2002 September 13} \.{Subject} lines can, of course, also contain sequences encoded in \.{Base64}, tagged with a ``\.{?B?}'' following the {\it charset} specification. Added decoding of these sequences, along with the requisite |decodeEscapedText| method of |base64MIMEdecoder|. Made a slight revision to the definition of tokens in the |tokenParser|. While ``\.{-}'' and ``\.{'}'' continue to be considered part of a token if embedded within it, they can no longer be the first or last characters of a token. This improves recognition of words in typical text, based on tests against the big collection. A new |not_at_ends| array of |bool| is used to define which characters may not begin or end a token. Completely rewrote how the |tokenParser| determines character types in parsing for tokens. Previously, characters were classified by looking them up in a collection of global arrays of |bool|. To permit changing the definition of a token on the fly, I defined a new class, |tokenDefinition|, which collects together the lookup tables which determine which characters constitute a token and indicate the sets of characters (if any) which cannot exclusively make up a token and which cannot be the first or last character of a token. In addition, the minimum and maximum acceptable length for tokens are stored and methods permit testing all of these quantities. You can initialise the values as you with the methods provided, or use pre-defined initialiser functions for ISO-8859 and ASCII alphanumeric sets. Well, let's declare this a red banner day for the \PRODUCT! No, you're not dreaming$\ldots$we're actually ending this day with {\it fewer} command line options than those which greeted the dawn, and the whole concept of the ``lookup dictionary'' has been banished, along with snowdrifts of prose in the documentation explaining the difference between a ``dictionary'' and a `lookup dictionary'' and the things you could or couldn't do with, or to, them respectively. The original idea was that you work with |dictionary| objects when assembling the database of mail and junk, and then export the results as a lean and mean lookup dictionary which could be loaded like lightning to classify subsequent messages. Well, it turns out that if you use binary I/O for the |dictionary|, it's just as fast as loading the lookup dictionary, and all of the confusion is eliminated. Further, the user is thereby encouraged to keep a dictionary on hand which can be updated at any time to incorporate new examples of mail and junk. This is all much more the Bayesian spirit of eternal refinement than settling on a probability set without subsequent refinement. Since the lookup dictionary is no more, there's no need to distinguish the |dictionary| read and write commands as binary. Hence, the \.{--binread} and \.{--binwrite} options have been renamed \.{--read} and \.{--write}, freed up by the lookup dictionary elimination. \date{2002 September 14} The direct concatenation of multiple-line header items added a couple of days ago broke |@| thanks to fat-fingered character counting in the recognition of sentinels. I fixed this, and modified the code to perform all parsing on a canonicalised string to avoid case sensitivity problems. Note that the \.{boundary} itself {\it is} and must remain case sensitive. Fixed some \.{gcc -Wall} natters which had crept in since the option was accidentally removed by \.{autoconf}. Added the ability to read a |mailFolder| from standard input. If the |fname| argument to the constructor is ``\/{-}'' |cin| is used as the input stream. Renamed the \.{--csv} option \.{--csvwrite} in keeping with nefarious plans soon to be disclosed, and added a pseudo ``\.{\ COUNTS\ }'' word to the start of the CSV file giving the number of mail and junk messages in the dictionary as is done in binary dictionary dumps. Changed the sort order for the CSV file so that words with identical probabilities are sorted into lexical order. Added a \.{--csvread} option to import a dictionary from a CSV file in the format created by \.{--csvwrite}. The CSV file is {\it added} to the existing in-memory dictionary; multiple \.{--csvread} and \.{--read} command may be used to assemble a dictionary. The CSV file imported need not be sorted in any particular order and may contain comments whose first nonblank character is ``\.{;}'' or ``\.{\#}''. In the process, I found and fixed a bug in updating the message counts which applied to both \.{--csvread} and the existing \.{--read} code, but which only manifested itself when loading multiple dictionaries. Wheels within wheels$\ldots$MIME \.{multipart} messages can, of course, be nested. You can be blithely parsing your way through a message when you trip over a part with a \.{Content-type} of ``\.{multipart/alternative}'', which pushes a new part boundary onto the stack, to be popped when the end sentinel of that nested section is encountered. What fun. We consequently introduce a new |partBoundaryStack| to keep track of the nested part boundary sentinels, along with all of the defensive code needed to cope with the realities of real world mail. \date{2002 September 15} Loosened up the test for \.{multipart} \.{Content-type} so that ``\.{multipart/related}'' types will be recognised. Added the long-awaited \.{--transcript} option. (Thanks, Kern, for suggesting it!) A transcript of the input message for a \.{--test} or \.{--classify} operation is written to the argument file name (standard output if the argument is ``\.{-}'', with \.{X-Annoyance-Filter-Junk-Probability} and \.{X-Annoyance-Filter-Classification} items appended to the header indicating the calculated junk probability and classification according to the thresholds. Finished the first cut of multiple byte character set decoders and interpreters. A {\it decoder} scans the mail body (encoded or not), and parses the byte stream into logical characters up to 32 bits in width. An {\it interpreter} expresses these characters in a form suitable for analysis. Ideographic languages are typically interpreted as one word per character, other languages as one letter per character. These components must, of course, be utterly bullet-proof as they will be subjected to every possibly kind of garbage in the course of parsing real-world mail. At the moment, we have decoders for EUC and Big5, and interpreters for GB2312 and Big5. Added a decoder for EUC-encoded Korean (\.{euc-kr}) as an example of how to handle an alphabetic language with a non-Western character set. \date{2002 September 16} Modified |EUC_MBCSdecoder| to discard the balance of any encoded line in which an invalid EUC second byte is encountered. After encountering such garbage, the rest of the line is usually junk and there's no profit in blithering through it. Added logic to scan \.{application} binary byte streams for possible embedded tokens. The new \.{--binword} option sets the shortest sequence of contiguous ASCII alphanumeric characters or dollar signs (with possible embedded hyphens and apostrophes, but not permitting these character at the start or end of a token---the default is 5 characters, which is a tad more discriminating than the \UNIX/ \.{strings} which defaults to 4 printable characters. You can disable the scanning of binary streams entirely by setting \.{--binword} to zero. Scanning binary streams might seem to be a curious endeavour, but it's highly effective at percolating text embedded in viruses and worm attachments to junk mail to the top of the junk probability hit parade, then screening them out when the arrive in incoming mail. Although the \.{Subject} line is the most important, any line in a mail header may actually contain quoted sequences specifying a character set and \.{Quoted-Printable} or \.{Base64} encoded characters. I modified |@| to no longer restrict decoding to the subject line. Once decoded, if the \.{charset} specification in a header line quoted sequence is a character set we understand, it is not decoded and interpreted. \.{ISO-8859} sets of all flavours are decoded but not processed further. Fixed a few \.{gcc -Wall} quibbles in |tokenDefinition| which popped up on Solaris compiler but didn't seem to perturb the almost identical version of \.{gcc} on Linux. Modified the \.{--test} option so that if the \.{--transcript} option has been previously specified with standard output as the destination (``\.{-}''), the junk probability is not written to standard output at the end of the transcript. \date{2002 September 17} The \.{Base64} decoder could hang if one of the lines it was decoding contained white space. Fixed. Added logic to detect and discard header items which begin with our own |Xfile| sentinel. This shouldn't happen in the normal course of things, but somebody may try to spoof a downstream filter by sending mail which contains a sentinel purporting to be a classification by of of its legitimacy. Deleting our own header items also allow us to process our own transcripts containing them and reproduce the same results as if they hadn't been added. Cleaned up the horrific |@| section which ``jes' grew'' in |mailFolder::nextLine| as more and more complexities were cranked in to MIME part decoding, multiple byte character sets, parsing ASCII strings out of binary data streams, etc. \date{2002 September 18} Cleaned up documentation of command line options, clarifying that they are logically commands which must be specified in the order in which they are to be executed. In the process, I added an example of invoking \PRODUCT\ as a pre-processor for a mail sorting program such as \.{Procmail} to the ``Quick and dirty user guide''. Added a new \.{\PRODUCT -run} shell script to execute the program in default filter mode with the executable and dictionary installed in the default ``\.{\$HOME/.\PRODUCT}'' directory. Oh, you haven't hear about that$\ldots$well, stay tuned$\ldots$details in the next episode. Incremental refinement of the \.{README} and \.{INSTALL} files, with many keystrokes to go before we put these documents to sleep. Added \.{--verbose} tell-tales for the \.{--plot} and \.{--statistics} options. Replaced the \.{\PRODUCT.1} manual page with a cop-out which directs the esteemed reader to the PDF program documentation. This thing is changing so rapidly that the last thing I need is to maintain four copies of the bloody command line option documentation. {\it Four?} Think about it: the program (\.{CWEB}), its embedded \.{--help} option text, a Web page (nonexistent at the moment, thank Bob), and a manual page. Keeping all four simultaneously in sync is something which could appeal only to an accountant. I'm a programmer, not an accountant---I drink their blood, but I don't do their work. The code which discards header lines we've generated attempted to remove lines from the transcript even when no transcript was being generated, for example, when adding a message we'd previously processed to the \.{--mail} or \.{--junk} database. This caused a |NULL| pointer reference in |@|---fixed. Hours of patient, unremunerated toil cleaning up \.{Makefile.in} to bash things into a distributable form. I added an \.{install} target which installs the program in the default \.{\$HOME/.\PRODUCT} directory, creating a customised \.{run} program (\.{\PRODUCT -run} in the build directory) which supplies the home directory which \.{sendmail} doesn't. Massive clean-up of \.{Makefile.in}, yielding a template which is far more generic for our next foray into software land. \date{2002 September 19} Further testing revealed that the segmentation fault in |dictionary::purge| which I thought I fixed a week or so ago was still lurking to bite the unwary soul whose dictionary contained a large number of words eligible for purging. As far as I can determine, when you |erase| an item from a |set|, not only does the iterator argument to the |erase| become invalid, in certain cases (but not always), an iterator to the {\it previous} item---not erased, becomes invalid, leading to perdition when you attempt to pick up the scan for purgable words from that point. After a second tussle with |remove_if|, no more fruitful than the last (for further detail, see the |dictionary::purge| implementation, I gave up and rewrote |purge| to resume the scan from the {\it start} of the |set| every time it erases a member. This may not be efficient, but at least it doesn't crash! In circumstances where a large percentage of the dictionary is going to be purged, it would probably be better to scan for contiguous groups of words eligible for purging, then |erase| them with the flavour of the method which takes a start and end iterator, but given how infrequently \.{--purge} is likely to be used, I don't think it's worth the complication. In a fit of false economy, I accidentally left the door open to the possibility that with an improbable albeit conceivable sequence of options we might try to classify a message without updating the the probabilities in the dictionary to account for words added in this run. I added calls on |updateProbability()| in the appropriate places to guarantee this cannot happen. The only circumstances in which this will result in redundant computation of probabilities is while building dictionaries, and the probability computation time is trivial next to the I/O and parsing in that process. In the normal course of events the vast majority of runs of the program will load a single dictionary and use it to classify a single message. Since we've guaranteed that the probabilities will always be updated before they're written to a file, there's no need to recompute the probabilities when we're only importing a single dictionary. I added a check for this and optimised out the probability computation. When merging dictionaries with multiple \.{--read} and/or \.{--csvread} commands, the probability is recomputed after adding words to the dictionary. If you used a dictionary in which rare words had not been removed with \.{--purge} to classify a message, you got screwball results because the $-1$ probability used to flag rare words was treated as if it were genuine. It occurred to me that folks building a dictionary by progressive additions might want to keep unusual words around on the possibility they'd eventually be seen enough times to assign a significant probability. I fixed |@| to treat words with a probability of $-1$ as if they had not been found, this simulating the effect of a \.{--purge}. Minor changes were also required to CSV import to avoid confusion between rare words and the pseudo-word used to store message counts. Note that it's still more efficient to \.{--purge} the dictionary you use on classification runs, but if you don't want to keep separate purged and unpurged dictionaries around, you don't need to any more. Added a new \.{--annotate} option, which takes an argument consisting of one or more single character flags (case insensitive) which request annotations to be added to the \.{--transcript}. The first such flag is ``\.{w}'', which adds the list of words and probabilities used to rank the message in the same form as included in the \.{--pdiag} report. To avoid duplication, I broke the code which generates the word list out into a new |addSignificantWordDiagnostics| method of |classifyMessage|. Added a ``\.{p}'' annotation which causes parser diagnostics to be included in the \.{--transcript}. This gets rid of all the conditional compilation based on |PARSE_DEBUG| and automatically copies the diagnostics to standard error if |verbose| is set. Parser diagnostics are reported with the |reportParserDiagnostic| method of |mailFolder|; other classes which report errors do so via a pointer to the |mailFolder| they're acting on behalf of. Well, my sleazy reset to the beginning trick for |dictionary| |purge| really was intolerably slow for real world dictionaries. I pitched the whole mess and replaced it with code which makes a |queue| of the words we wish to leave in the dictionary, then does a |clear| on the dictionary and re-|insert|s the items which survived. This is simple enough to entirely avoid |map| iterator hooliganism and runs like lightning, albeit using more memory. Break out the champagne! The detestable |MIME_DEBUG| conditional compilation is now a thing of the past, supplanted by a new ``\.{d}'' \.{--annotate} flag. No need to recompile every time you're inclined to psychoanalyse a message the parser spit up. Added a |name| method to |MIMEdecoder| and all its children, then took advantage of that to dispense with the horrific duplication of decoder diagnostic code in |@|. What was previously dispersed among the several branches of the decoder activation is now collected together in a single case after the decoder has been chosen. Modified \.{Makefile.in} to delete the fussy \.{core.}{\it process} files Linux has taken to produce. Fixed \.{configure.in} to specify \.{-Wall} if we're building with GCC. \date{2002 September 20} On Solaris, GCC is prone to hang if invoked with \.{-O2} (at least as of version 2.95.3). I twiddled the \.{configure.in} to change the compile option to \.{-O} for Solaris builds. \.{ctangle} and \.{cweave} spewed copious warnings on a GCC \.{-Wall} build. To avoid modifying these programs, which are prefectly compliant ANSI C, I changed \.{Makefile.in} to suppress the \.{-Wall} option for them when the compiler is detected as GCC. \.{make dist} didn't do a \.{make distclean} before generating the distribution archive, which could result in build-specific files being included in the archive. Fixed. \date{2002 September 21} Added documentation on how to integrate \PRODUCT\ into a \.{.forward} pipeline to \.{Procmail}, and build a \.{.procmailrc} rule set for a typical user-level filtering. It's 03:40 and I'm going to get some sleep before proofing this text---at the moment it's something between a random scribble and a first draft. Okay, I just couldn't {\it stand it}$\ldots$I just {\it had} to take another crack at the infernal |dictionary::purge| method. One of the many bees in my bonnet buzzed the idea into my ear that I could avoid both the extra memory consumption of yesterday's scheme and the risk of instability in the container by testing the probability of the first item in the |map|, adding it to the |queue| of survivors if its probability is significant, then performing an |erase(begin())|. Cool, huh? No iterators, no mess, no two copies of any word in memory. The hits just keep on coming$\ldots$the stupid built-in purge in |dictionary::resetCat| also ran afoul of the ``stale iterator'' problem. I blew it away---henceforth, it's up to you to do a \.{--purge} after a \.{--clearmail} or \.{--clearjunk}. With the new tolerance for un-purged dictionaries, no great harm will be done if you forget. Added a \.{\\subsection} macro to create subheads within documentation sections. The section number is automatically grabbed from the \.{cwebmac.tex} definition, but lower level numbering is manual, permitting you to add additional levels of hierarchy with a specification like:\hfill\break \.{\\subsection\{4.2.1\}\{Twiddling little details\}}. It turns out that all the cheesy mess I put in to patch the user's home directory into the \.{\PRODUCT -run} script wasn't necessary after all since \.{sendmail} is kind enough to change to the user's home directory before piping a message to a program. This means we can just \.{cd} to \.{.\PRODUCT} relative to the home directory. This also means one can remove the absolute path name from the \.{.forward} file, which cleans up the documentation on integration with \.{Procmail}. Added a rather tacky \.{check} target to the \.{Makefile.in} to serve as a ``sanity check'' that doesn't require an extensive training databases. The scheme is to train the program with the source code for \.{\PRODUCT .w} serving as the mail collection and \.{statlib.w} the junk bucket. Then those programs themselves are tested, and the transcripts verified to confirm they were correctly classified. Astute observers will ask where I get off using something which isn't a well-formed mail folder to train the program. Well, it works thanks to a gimmick I put into the probability calculation to keep it from dividing by zero if one or both of the message counts were zero. That keeps anything untoward from happening when we're missing message headers, and the difference in the word content of the two files is so extreme that they reliably score correctly. Added a new Perl gizmo, \.{TestFolder/testfolder.pl}, which walks through a mail folder, breaks out each message, and passes it through \PRODUCT\ to obtain the probability and classification. (The \PRODUCT\ command is defined by a string within the Perl program, so you can modify as you wish to evaluate the effects of other settings.) At the end of the folder, the total message count, number of messages scored as junk and mail, and the mean probability of messages in the folder are printed. Added a ``back'' command to \.{SplitMail/splitmail.pl}. As you walk through a mail folder, the start address of each message you've seen is kept in a stack. The ``\.{b}'' command pops the stack and backs up to the previous message. This should reduce the pain when your sorting a folder and accidentally hit ``\.{d}'' when you meant to save the message somewhere. You can even go back after a search operation. Moved the \.{splitmail.pl} and \.{testfolder.pl} from their own dedicated directories into a new \.{utilities} directory which \.{Makefile.in} includes in the archive. If and when these utilities require common code, such as the CSV parser, it will be easier to manage them all in the same directory. Added help, requested by the ``\.{?}'' key, to \.{splitmail.pl} at both the disposition and the ``more'' prompt while viewing message text. If you assign additional folder destinations to disposition keys, they are automatically included in the help output. Now that \.{splitmail.pl} is equipped with a ``back'' mechanism, there's no reason not to interpret a void disposition as a request to advance to the next message---if it's a fat-finger, just go back. Trolling through a target-sparse folder can now be done at the expense of only one keystroke per message. \date{2002 September 22} Went ahead and added code to dereference symbolic links up to 50 deep when deciding whether files are \.{gzip} compressed in |mailFolder|. What the heck, it's the solstice (well, it was a couple of hours ago) and the full Moon to boot---better to write silly code than trying to balance eggs on their little ends! Much work on the documentation today, but little on the code. Slowly the python peristalsis moves us toward release. \date{2002 September 23} We're off to see the lizard, the wonderful lizard of WIN32\null! Naturally, all of our carefully crafted code to set up pipelines to decompress dictionaries evaporated under the harsh sun of WIN32\null. I added conditional compilation to disable everything that incompetent empire self-defined by its own {\it limes} and rusty Gates doesn't comprehend. Building for WIN32 with DJGPP resulted in a natter about comparison of the |size_type| of a |multimap| to an |unsigned int|. The Linux compiler accepted this without a quibble. I added a |static_cast| to clear up the confusion. OK, it built on WIN32 with DJGPP 2.953 and even passed the rudimentary tests I threw at it. So, I copied the executable back to the development directory, then discovered and fixed numerous bugs in the archive creation code in \.{Makefile.in} when the WIN32 distribution is enabled. Got better. A Zipped WIN32 build is now posted in the Web directory and linked to from the home page. The \.{configure.in} script didn't check for the \.{-lm} math library. This somehow managed to work on Linux and Solaris, but failed on FreeBSD. I added the necessary \.{AC\_CHECK\_LIB} macro. (Reported by Neil Darlow). Fixed several typos in the documentation of |computeJunkProbability| and reformatted the formula as a stacked fraction so it fits better on the page. Added logic to \.{configure.in} to test for the presence of the \.{system} function and the \.{gnuplot} and \.{ppmtogif} utilities required by the \.{--plot} option. If any of them is missing, the option will be disabled when the program is compiled. Added a test to \.{configure.in} for the presence of \.{readlink} and disabled the code that chases symbolic links in file name arguments if it's absent. I also added a ``probable loop'' warning if this code exceeds the maximum link depth limit. Added a configurator test for the presence of \.{popen} and code to disable the ability to read compressed files if it's not present. This allowed me to remove the special case for \.{WIN32} I added last night to build on \.{DJGPP}---it's now subsumed into the test for \.{popen}. Designed this version as ``Release Candidate 1'' and indicated this by setting |VERSION| to \.{"0.1-RC1"}. Proofed the program documentation and the formatting of the code listing and fixed numerous typos and infelicitous layout. Defined \.{-t} as a shortcut single-letter option for \.{--test} and \.{-r} as a shortcut for \.{--read}. Release 0.1-RC1. \date{2002 September 24} Hugh Daniel took a look at the program and had many comments and suggestions. Until otherwise noted, the following items result from them. Corrected ``vertical interlace'' terminology in the document to ``vertical retrace''. I'm forever screwing that one up. Renamed \.{--purge} to \.{--prune}, which is a more precise (and less intimidating) description of what it does. For the moment, \.{--purge} is still accepted to ease the transition. Fixed the \.{check} target in \.{Makefile.in} to use \.{--prune}. Added the hideous logic to \.{Makefile.in} to report overall pass/fail status for the \.{check} target. Clarified the infectuous nature of the GPL in \.{COPYING}. While I was at it, I added information about the public domain status of DCDFlib. Okay, back to self-generated items$\ldots$\,. Changed the \.{--plot} option to use \.{pnmtopng} to generate the plot in PNG format instead of GIF. Release 0.1-RC2. \date{2002 September 26} Added the ability to treat a directory as a mail folder consisting of messages in individual files in the directory. The contents of the directory are simply logically concatenated and are not restricted to one message per file--they may be \UNIX/ mail folders in their own right. After a huge amount of wasted effort trying to do this in an ultra-clean \CPP/ fashion by defining an |idirstream| flavour of |istream| which returns the concatenated contents of files in a directory (I got {\it that close}, but couldn't make it work with the |getline| function for |string| without stooping to ugliness and making assumptions about the guts of the |iostream| package I believed unwarranted. This dead end is why you see no log entries for yesterday. So, I ripped all that out and simply added logic to |mailFolder| to detect when it's passed a directory and wrap a loop traversing the directory around the main input loop; when end of file is encountered and we're traversing a directory, we look for the next file and commence processing it, declaring a genuine end of file only at the end of the directory. This interacts in an interesting way with the MIME decoders. Recall that they are passed the actual |istream| from which the |mailFolder| normally reads and take charge of it until the end of the encoded section is reached. I added {\it no} logic to them specific to directory traversal---when they hit the end of the stream, they declare a missing terminator at the end of the section and bail out. But that's {\it good}---we don't want a missing terminator to gobble up the contents of a subsequent file in the directory folder. (Although if each file begins with a ``\.{From\ }'' line, it will cause the detector to bail out. This way, it's only after arriving back from the decoder that we detect we're at the end of a file in the directory and progress to the next item, if any, in the directory. Yes, all of this is conditional on the presence of |opendir| and |stat|, which are required to detect and traverse the directory; the whole mess goes away if \.{configure.in} doesn't detect them. Yes, files in the directory may be compressed. And, yes, files in the directory may be symbolic links to compressed. But no, you can't recursively traverse directories; directories within a directory folder are simply ignored, which nicely avoids a special case for ``\.{.}'' and ``\.{..}''. In the process of putting in all this junk, I discovered that the existing code for decompressing mail folders failed to call |pclose| to close out the pipeline, which is unkind. I added a destructor which makes sure it's called when necessary. Added a new \.{fragmail.pl} program to the \.{utilities} directory. It splits up a monolithic mail folder into a directory with one message per file, making up file names from the message sequence in the input folder. Added a new \.{signatures} target to \.{Makefile.in} which creates \pdfURL{GnuPG}{http://www.gnupg.org/} signatures for each of the downloadable files and added a command to the \.{publish} target which copies them to the distribution directory. Added code to \.{configure.in} to test for the presence of \.{pdftotext}, which we will eventually use to crack PDF files. Let's be realistic, however. This is cool (and will open the door to a general application specific binary file cracker, which I've been itching to do), but in terms to the mission statement of \PRODUCT\ and present day junk mail, is far from important. I've found precisely one PDF file in each of my mail and junk archives, so with a plane to catch tomorrow, I'm not going to stay up any later tonight worrying about refinements of this kind. Release 0.1-RC3. \date{2002 September 29} Added logic to \.{Makefile.in} to prepare an HTML version of \.{man} page automatically from the \PRODUCT\.{.1} \.{troff} file. The output will require fixup since it is intended to be run from a CGI script, but should eliminate much of the duplication of labour inherent in maintaining parallel documentation in HTML and \.{man} page format. \date{2002 October 1} Expanded documentation of command line options in conjunction with preparation of a manual page using the \.{docutil/options.pl} translator. Added ``{\mc USAGE}'', ``{\mc EXIT~STATUS}'', and ``{\mc FILES}'' sections to the manual page; all of these are specific to the man page and are not derived from \.{\PRODUCT.w}. \date{2002 October 2} Much work yesterday and today on automating the generation of documentation from the \.{CWEB} source file. I wrote a Perl program, \.{docutil/options.pl} to compile the options documentation from \.{\PRODUCT.w} into \.{troff} format with the \.{-man} macros. Actually, although containing special cases for the options, this is reasonably general and may be deployed for other common documentation in the future. The output from \.{man2html} has some infelicitous links and formatting for HTML intended to be shipped with the product and included on its Web page. I wrote a Perl hack, \.{docutil/fixman2html.pl}, to correct these items, and modified the \.{Makefile.in} targets to generate a first draft HTML in \.{\PRODUCT\_man\_raw.html}, which is post-processed by the fixup program into the final \.{\PRODUCT\_man.html} file, which is now included in the distribution by the \.{dist} target and copied to the Web directory by \.{publish}, both of which targets generate it if necessary. Added a \.{mantroff} target to \.{Makefile.in} to preview the \.{troff} format manual page using ``\.{groff\ -X}'' (if available on the system---if not, don't do that). Wrote a \.{docutil/cwebextract.pl} Perl program which searches a \.{CWEB} file for a named section (which can be a regular ``\.{@@}'' section, so long as the search target appears on the same line as the ``\.{@@}''. If the section is found (matching is case insensitive and the search target given on the command line matches the first line containing a substring which it matches), the contents of the documentation section is written to standard output, trimming leading and trailing blank lines. The end of the documentation section is the next line which begins with an at sign or the end of file. Moved the \TeX\ definitions used to generate the options list to the top of \.{\PRODUCT.w} so they don't confuse the automatic extraction and translation process. Modified \.{docutil/cwebtex2man.pl} to ignore \TeX\ \.{\bslash bigskip} commands, carefully avoiding generating a nugatory \.{.PP} in the \.{troff} output due to two consecutive blank lines once the command has been ignored. Added the \.{docutil} directory and its contents to the distribution generation target in \.{Makefile.in}. Generation of the ``{\mc OPTIONS}'' section of the \.{\PRODUCT.1} manual page from the corresponding section of \.{\PRODUCT.w} is now completely {\bf Turbo~Digital}$^{\rm TM}$. The invariant parts of the manual page are now defined in the ``manual page macro'' file \.{\PRODUCT.manm}. The \.{Makefile.in} now understands that \.{\PRODUCT.1} is generated by processing this file with \.{docutil/manm\_expand.pl} which expands \.{\bslash"\%include} statements in the macro file by extracting the specified section from the named \.{CWEB} file with \.{docutil/cwebextract.pl}, translating it into manual page \.{troff} with \.{docutil/cwebtex2man.pl}, and inserting it in the output file in place of the include statement. This completely eliminates all manual labour when updating the options in the manual page and guarantees that changes to the option documentation in \.{\PRODUCT.w} are propagated to the manual page document. The same mechanism can be used for other common documentation as the need arises. \date{2002 October 3} Subtly obfuscated the E-mail address to which bugs should be reported in the manual page so the process of transforming it into HTML won't result in a deadly \.{mailto:} link or a sniffable address in the page. Visual fidelity for human readers is maintained. Updated the Web document to reflect the existence of the HTML manual page and added links to it. Added a reference to the PDF document to the ``{\mc SEE~ALSO}'' section of \.{\PRODUCT.manm}. Fixed an embarrassing hyphenation of a file name by prefixing the offending word with the \.{troff} ``don't hyphenate'' escape ``\.{\bslash\%}''. (Apparently, even in \.{nh} mode, \.{troff} will hyphenate a word which contains an embedded hyphen unless you explicitly forbid it.) Added the \.{.w} files to the \.{winarch.zip} archive used to transfer files to build for Win32. While they aren't strictly required, they're awfully handy to have should you encounter compile errors, which are reported with line numbers from the \.{CWEB} file. Looking it up while on Windows and patching the \CPP/ file is a lot quicker than booting back into a real operating system to explore the problem. In |@| there was an erroneous reference to |dirFolder| not conditional on |HAVE_DIRECTORY_TRAVERSAL|---fixed. The |mailFolder| constructor which accepts a file name in a |string| re-used the |ifstream| |isc|, which was previously used only when reading compressed files. This caused compile errors on systems where |COMPRESSED_FILES| was not defined. We now unconditionally define |isc| in the |mailFolder| class definition. With these fixes, the \.{makew32.bat} build on Win32 now works once again. Added a \.{testw32.bat} file which runs a rudimentary test of the Win32 build similar to the \.{check} target in \.{Makefile.in}. I added this file to both the \.{dist} and \.{winarch} archive generation targets in \.{Makefile.in}. Modified \.{Makefile.in} to replace the hard-coded \.{/ftp/\PRODUCT} destination with a \.{PUBDEST} declaration at the top of the file which defaults to the same directory. This permits overriding the default publication destination for use at another site or for nondestructive testing of new releases simply by editing the \.{Makefile}. Some day, it might make sense to permit overriding this with an option at \.{./configure} time, but this is not that day. Release 0.1-RC4. \date{2002 October 11} Integrated the application string parsers for Flash and PDF formats, which were developed in a separate stand-alone test program. These include the classes |applicationStringParser| (mother of all application parsers), |flashStream|, |flashTextExtractor|, and |pdfTextExtractor|, the latter compiled in only if all the utilities it needs to decode PDF via a pipe to \.{pdftotext} are present. At the moment, these aren't hooked up to the mail folder, but are merely exercised by code in the \.{--jig}. Integrated Knuth and Levy's \.{CWEB} version 3.64 in the \.{cweb} directory. The \.{CWEAVE} and \.{CTANGLE} programs are built with a change file, \.{common-bigger.ch} which increases the input line length limit to 400 characters as I did in the earlier 3.63 release. Added plumbing to invoke Flash and PDF parsers for attachments with those application types. Thanks to the inability to take a class member function as an unqualified function pointer, this is somewhat tacky, requiring a pointer to the |mailFolder| to obtain decoded data. \date{2002 October 12} Added decoders and interpreters for Shift-JIS and Unicode (UCS-2, UTF-8, and UTF-16 encodings). These are used to decode and interpret these character sets in Flash animations whose fonts are so tagged. Added logic to invoke the new Unicode UTF-8 decoder when a MIME part's \.{charset=} designates it so encoded. \date{2002 October 13} In the process of testing UTF-8 decoding of Unicode messages, I stumbled over a bug in ignoring HTML comments embedded within tokens, a common trick in junk mail to evade na\"\i ve filters, for example, ``\.{remove\ yourself}''. (Yes, I know a valid HTML comment is supposed to contain a space after the initial and before the final sentinel, but junk mail often violates this rule, counting on sloppy browsers not to enforce the standard, so we must comply in the interest of ``seeing what the user would''.) HTML comments are now completely discarded, even when embedded within tokens. The \.{dist} target in \.{Makefile.in} failed to clean the \.{cweb} directory before including it in the source archive, which could have the result of leaving objects and binaries not compatible with the system on which the user is installing. I modified the target to descend into the \.{cweb} directory and \.{make\ clean}. This promptly ran into another problem because the \.{CWEB} \.{Makefile} deletes the \CEE/ source for \.{CWEAVE}, using the bootstrapped \.{CTANGLE} to re-build it. This is clean, but runs afoul of my rebuilding both programs directly in the outer \.{Makefile}. I saved the original \.{CWEB} makefile as \.{Makefile.ORIG} and modified the \.{clean} target in the actual \.{Makefile} to leave \.{cweave.c} around. I also modified our own \.{clean} target to clean the \.{cweb} directory as well. Attempting to build \.{.dvi} or \.{pdf} targets after you'd cleaned the \.{cweb} directory failed for lack of \.{cweave}; I added a dependency to \.{Makefile.in} to ensure it's rebuilt when needed. Since certain recent versions of \.{gcc} libraries have begun to natter if \CPP/ include files specify the \.{.h} extension (which, for years, was {\it required} by those self-same libraries), I eliminated them from our list of includes, which finally seems to work on \.{gcc} 2.96. Doubtless this will torpedo somebody using an earlier version. Broke up the unreadably monolithic list of include files into sections which explain what's what. {\bf Dooooh!} Forgot to disable the declaration of the |pdfTextExtractor| in |mailFolder| when \.{HAVE\_PDF\_DECODER} was not defined, which was the undoing of the Win32 build; fixed. Release 0.1-RC5. \date{2002 October 19} Added a check in |classifyMessages| to verify that a dictionary has been loaded before attempting to classify a message. If no dictionary is present, a warning is written to standard error and the junk probability is returned as 0.5. Added a warning if command line are specified after a \.{--classify} command. Since this command always exits with an exit code indicating the classification, specifying subsequent arguments is always an error. Added a bunch of consistency checking for combinations of options which don't make any sense and suggest the user doesn't understand in which order they should be specified. To facilitate this, I modified the code for the \.{--classify} option to set a new |lastOption| flag to bail out of the option processing loop and set |exitStatus| to the classification rather than exiting directly before the option consistency checks are performed. This cleans up the control structure in any case. In the process of adding the above code, I discovered that the |any()| method of |bitset| seems to be broken in the \.{glibc} which accompanies \.{gcc} 2.96. I tested |count()| against zero and that seems to work OK. Implemented phrase tokens. You can consider phrases of consecutive tokens as primitive tokens by specifying the minimum and maximum words composing a phrase with the \.{--phrasemin} and \.{phrasemax} options. These default to 1 and 1, which suppresses all phrase-related flailing around. If set otherwise, tokens are assembled into a queue and all phrases within the length bounds are emitted as tokens. How well this works is a research question we may now address with the requisite tool in hand. \date{2002 October 20} Added code to import a binary dictionary file with the \.{--read} option using memory-mapped I/O if \.{./configure} detects that facility and defines \.{HAVE\_MMAP}. This isn't a big win on individual runs of the program, but if you're installing it on a high volume server, multiple read-only references to the dictionary file (be sure to make the file read-only, by the way) can simply bring the file into memory where it is re-used by multiple instances of the program. (Of course, if the system has an efficient file system cache, that may work just as well, but there's no harm in memory mapping in any case.) Thanks to the \CPP/ theologians who deprecated the incredibly useful |strstream| facility, which is precisely what you need to efficiently access a block of memory mapped data as a stream, I included a copy of the definition of this facility in \.{mystrstream.h} so we don't have to depend on the \CPP/ library providing it. I was a little worried about writing phrases in CSV format without quoting the fields, but I did an experiment with Excel and discovered it doesn't quote such fields either---it only uses quotes if the cell contains a comma or a quote (in which case it forces the quote by doubling it). Since our token definition doesn't permit either a comma or a quote within a token, we're still safe. \date{2002 October 21} Added a \.{--phraselimit} option to discard phrases longer than the specified limit on the fly. This prevents dictionary bloat due to ``phrases'' generated by concatenation of gibberish from headers and strings decoded from binary attachments. These will usually be eliminated by a \.{--prune}, but that doesn't help if the swap file's already filled up with garbage phrases before reaching the end of the mail folder. The default \.{--phraselimit} is 0, which imposes no limit on the length of phrases. \date{2002 October 22} When the default |getNextEncodedLine| of a |MIMEdecoder| encountered the ``\.{From\ }'' line of the next message in a mail folder, it failed to store the line as the part boundary, which in turn caused |mailFolder| to mis-count the number of messages in a folder being parsed when training. I fixed this, and in the process re-wrote an archaic \CEE/ string test used in |@| to use a proper \CPP/ |string| comparison. Corrected some ancient URLs in \.{README}, and added information on the SourceForge project there and in \.{annoyance-filter.manm}. Release 0.1-RC6. \date{2002 October 23} Modified \.{docutil/fixman2html.pl} to include an absolute URL for the ``Fourmilab Home Page'' link. This gets people back to the site when the resulting manual page is posted on SourceForge. Updated the \.{distclean} target in \.{Makefile.in} to get rid of several intermediate files which had crept in since the last housecleaning. These made it more difficult to detect any new files which required adding to the CVS repository. Added the \.{utilities/maildir\_filter.pl} utility contributed by Travis Groth. This has been added with CVS but not yet committed. \date{2002 October 26} Added a \.{--biasmail} option to set the frequency bias for words and phrases found in legitimate mail. Previously this was fixed at 2, which remains the default. Added \.{autoconf} plumbing to detect all the myriad stuff required to support POP3 proxying. We attempt to distill all of these detections down to a \.{POP3\_PROXY\_SERVER} definition which controls all code related to that capability. \date{2002 October 27} Integrated the stand-alone POP3 test article as a new |POP3Proxy| class with a hard-coded exerciser in the \.{--jig}. At the moment, it's purely a proxy---it doesn't interpose the filter. \date{2002 October 30} After much struggling, the POP3 procy now seems to be working, so it's time to integrate it fully into the program. Added a \.{--pop3port} option to specify the port on which the POP3 proxy listens for connections. If no specified, the port number defaults to 9110. Added a \.{--pop3server} option to specify the server and optionally, port (which defaults to 110 if not given) to which the POP3 proxy server will connect. This must be the last option (a warning is given if it isn't), and causes the server to immediately begin operation. I removed the server test code from the \.{--jig} and physically moved it to a subsection within the ``POP3 proxy server'' section, following the class definition. \date{2002 October 31} Disabled the \.{--jig}, since there's nothing in it at the moment. Added proper conditional setting of |POP3_PROXY_SERVER| based on the capabilities sensed by \.{autoconf} and fixed one compile problem if the proxy server is disabled. At the moment, we assume that if |socket| and |signal| are defined, everything else we'll need will also be defined \date{2002 November 1} Cleaned up POP3 proxy code and added documentation of the related command line options. I still need to add a main document section on how to install and operate a proxy server. \date{2002 November 2} We weren't activating the byte stream parser for spoofed mail worm attachments which trick Microsoft Outlook into executing an attachment through the incredibly subtle strategem of declaring the attachment as an innocuous file type such as audio or image, but with an extension which denotes an executable file. Brain-dead Outlook decides whether to block or confirm executable content based upon the former, but then actually executes the file based upon the latter. Can you say ``duh''? Well, thanks to this particular piece of Redmond rot, tens of millions of these worms continue to pollute the net since, even though the hole has been plugged, millions of the bottom-feeders who use such software continue to use unpatched versions and/or run machines which are already infected and actively propagating the worm. All right, enough polemic. What this means for \PRODUCT\ is that when we see an attachment with a \.{Content-Type} which usually denotes something we're not interested in parsing, but then discover its file name is one of the suspicious executable Microsoft file types, we need to feed it through the byte stream parser just as if it were tagged with an ``\.{application}'' file type. Doing so will extract the inevitable embedded strings, which will act as a signature for subsequent encounters with the same or similar worm. (SourceForge bug 631503, reported by Neil Darlow.) Improved diagnostics for parser errors by saving the ``\.{From\ } line and \.{Message-ID} (if any) from the header and then labeling any parser diagnostics written to standard error with the \.{--verbose} option with them. The labels are written only before the first diagnostic for each message in a folder, and diagnostics are now indented to better diatinguish them from the labels. Diagnostics from |MBCSdecoder| objects were written to standard error without any identification of the message in which they occurred. I added the ability to link an |MBCSdecoder| to its parent |mailFolder| with the new |setMailFolder| method. If linked, diagnostics from the decoder are emitted via the |reportDecoderDiagnostic| method of the linked folder, permitting them to be labeled with the message identification as described in the previous paragraph. It's still possible to use an |MBCSdecoder| without linking it to a |mailFolder|---if the link is |NULL|, diagnostics are sent to standard error as before. Improved diagnostics from the various |MBCSdecoder| classes. All reports of invalid two-byte sequences now report both hexadecimal bytes, and other invalid value diagnostics report the offending hexadecimal value. Added the ability to search for a literal substring as well as a regular expression in \.{utilities/splitmail.pl}. If the search target begins with ``\.{+}'' (which is invalid in a regular expression), the balance of the pattern is searched for with case-insensitive comparison. Since so many of the message headers you're likely to be looking for contain regular expression meta-characters, it's a lot more convenient to specify an explicit target than remember what they all are and quote them. Corrected the diagnostic for an unknown character set in a header line string to say ``Header line'' rather than the obsolete and misleading ``Subject line'' it used to say. Added ``\.{us-ascii}'' to the list of character sets which require no multi-byte decoding or interpretation when they appear in header line quoted strings. Junk mail sometimes encodes even ASCII subject lines (and sometimes other headers) as Base64 or Quoted-Printable to hide the text from na\"\i ve filters. Added a script to build under Cygwin, \.{makew32.sh}. Attempting to link in our own copies of \.{getopt.c} and \.{getopt1.c} runs afoul of the Cygwin linker ({\it why?}), so I removed them from the compiles and link done by this script. Building on Cygwin failed because the library I was using didn't define \.{in\_addr\_t}. I'd seen this earlier on Solaris, but had inadvertently added a new reference since I'd last tested there. I changed the offending reference (in a \.{static\_cast} of all places), to our cop-out type \.{u\_int32\_t}, which \.{autoconf} guarantees will always be there. With that fix, the program built {\it and worked} on Cygwin, including POP3 proxying! The check for non-white space following a soft line break in a Quoted-Printable MIME part failed for a POP3 proxy message containing CR/LF line terminators. I broadened the definition of white space in |@| to include carriage return. \date{2002 November 3} Scribbled a first cut \.{README.WIN} file to be included in the Win32 executable archive which explains the issues involving the included Cygwin DLL\null. I modified \.{Makefile.in} to include this file, the DLL, and \.{COPYING.GNU} (the GPL) in the Win32 archive. Tested the Win32 archive on a Cygwin-free machine. Seems to work OK, including POP3 proxy from another machine on the LAN. Verified that POP3 proxy on a Cygwin-free machine running Windows 98 works with the version of Outlook furnished with that system, which can be configured to retrieve messages from "localhost" on our default port of 9110. Note, however, that one must first configure the account (defaulting to port 110), then edit the properties of the account, using the ``Advanced'' tab to specify the POP3 port of 9110. Messages embedded within other messages with the \.{Content-Type} specification of \.{message/rfc822} did not have their own MIME parts correctly decoded because |mailFolder| failed to scan the header of the embedded message for its own \.{Content-Type} and \.{boundary} specifications. Fixed. This should get rid of the previously mysterious long gibberish strings which decoded out of forwarded messages with image and other binary attachments. The strings were due to the Base64 decoder not being activated for the embedded message's attachments. \date{2002 November 5} Implemented the first cut of fast dictionary support. Having created a dictionary in memory, you can export it to a file in fast dictionary format with the \.{--fwrite} option. The \.{--fread} option loads such a dictionary and, if loaded, it takes precedence over a regular |dictionary|. This permits fast classification of messages without all the overhead of creating a full-fledged in-memory dictionary. Added memory-mapping of the fast dictionary when |HAVE_MMAP| is defined. In the interest of code commonality, the header fields are read from an |istrstream| bound to the memory mapped block, but access to the hash and word tables are pure pointer-whack. Fixed a typo in \.{configure.in} which caused a harmless but ugly warning when running the script. Disabled static linking for SunOS systems in \.{configure.in} due to GCC's inability to find the networking libraries when static linking. Added a list of optional capabilities detected by \.{configure} to the \.{--version} output. This makes long-distance diagnosis of configuration problems easier. The check for attempting to start a POP3 proxy server without having loaded a dictionary didn't test for a fast dictionary's having been loaded. Fixed. The destructor for |fastDictionary| attempted to |delete| the in-memory dictionary even when it was, in fact, memory mapped from a file. I added conditional code to replace the |delete| with a |munmap| and |close| of the file. In addition, I added logic to unmap and close the file if an error was detected while reading its header. Modified the ``\.{check}'' target in the \.{Makefile.in} to use a fast dictionary for the junk test. This guarantees the fast dictionary code will be exercised in the normal course of building and installation. Added the \.{-x} option to the invocation of the shell in the Cygwin \.{makew32.sh} script so we can see what's going on during the build. \date{2002 November 6} Created a \.{pop3proxy.pif} file as a skeleton PIF the user can edit (with ``Properties'' from the right click menu) to set up an auto-start POP3 proxy server, Discovered that \.{README.WIN} (the description of Cygwin related issues for the Windows executable archive) was missing from the comprehensive source archive. It was also missing from the CVS tree. Both fixed. Added confirmation messages for exporting and loading fast dictionary files when \.{--verbose} is set. Added an option to the \.{tar} command used to create the source archive to exclude the CVS subdirectories. This works only with Gnu \.{tar}, but that should be OK, since we only create distributions on systems so equipped. Release 1.0-RC1. \date{2003 January 22} Added code to the |POPDEBUG| output to echo both status replies from the server and the body of multi-line reply messages. Eliminated some obsolete disabled code in |@| in POP3 proxy support. Promoted the POP3 trace facility from conditional compilation to a full-fledged option, \.{--pop3trace}, which causes the trace output to be written to |cerr|, tagged with a prefix of ``\.{POP3:\ }''. Added trace output to show replies sent to the client, both status lines and multi-line bodies. Removed the disables (got that?) of |HAVE_DIRENT_H| and |HAVE_POPEN| for \.{WIN32} builds, permitting directory traversal when building dictionaries and expansion of compressed files (if \.{gzip} is installed on the system). These were previously disabled when we built with \.{DJGPP}, which didn't support these features; Cygwin does. \date{2003 January 23} Made the |writeMessageTranscript| methods of |mailFolder| |const|, as they don't change any member of the class. Added a new |sizeMessageTranscript| method to |mailFolder| which computes the size of the file written by |writeMessageTranscript|. If you intend to export the transcript with a different per-line overhead than the one byte added by |writeMessageTranscript|, you can pass a |lineOverhead| argument to specify the overhead; the default value is one. I finally figured out what was causing ``hangs'' when transferring large messages as a POP3 filter on \.{WIN32} platforms (Cygwin builds). Well, it {\it wasn't} hung---it had just slowed down by several factors of a thousand and nobody noticed the difference. ``Why?'', you ask. Well, it turns out that after all the real work is done, |popFilter| called |writeMessageTranscript| with an |ostringstream| to create the reply message body to be returned to the POP3 client. This apparently trivial operation, which is essentially instantaneous on a Linux or Solaris box with GCC and its libraries, runs a tad slower under the Cygwin version of the very same compiler and libraries. How much slower? Well, for a half-megabyte file, about 1500 times slower! Worse, the slow-down grows much faster than linearly with the size of the file; I tested a one megabyte file and gave up after several hours of watching it. Presumably there is some idiocy in the allocator used to expand the |string| within the |ostringstream| which is causing it to take longer and longer as the string grows. I rewrote the code in question to use a trusty |ostrstream| directed at a dynamically allocated buffer (that's what |sizeMessageTranscript|, discussed above, is for), and the whole thing runs too fast to measure under both Linux and Cygwin now. Ain't ``source compatibility'' fun? Moved the include of \.{mystrstream.h} outside the conditional for |HAVE_MMAP|, as it is now needed by the |popFilter| code as well. Added \.{mystrstream.h} to the files included in the \.{WIN32} transfer archive by the \.{winarch} target in \.{Makefile.in}. To avoid possible copying of the string containing a large message body and to make the code consistent, modified |@| to use an |istrstream| directed at the data of the |reply| string rather than an |istringstream|. Given the adventures we've had with |ostringstream|, the less I have to do with these beasts the better. Added the ability to limit the size of single |send| calls writing a multi-line reply body back to a POP3 client in |@|. If |POP3_MAX_CLIENT_WRITE| is defined, multiple sends no larger than that value will be used. Otherwise, all the data will be sent in a single monolithic |send| as before. This was added in the process of chasing down the Cygwin ``hang'' problem, and for the moment I've left the code in place in case it should be needed in the future. The |mailFolder| constructor which takes an |istream| argument did not clear |dirFolder| when built with |HAVE_DIRECTORY_TRAVERSAL|. This ran the risk that, at the end of the folder, we would erroneously call |readdir| to look for the next file in a nonexistent directory. This was particularly a risk for POP3 proxying, where the mail folder is created on the stack and static initialisation doesn't occur. I added an explicit clear of |dirFolder| in the |istream| constructor of |mailFolder|. Added a program, \.{fromtest.pl}, to the \.{utilities} directory, which scans a mail folder and checks for occurrences of the initial string ``\.{From\ }'' not preceded by the start of file or a blank line. Most Unix mail folders obey this convention, but the original definition of BSD mail folders required {\it every} occurrence of ``\.{From\ }'' at the start of a line to be quoted (traditionally with ``\.{>}''). You can use this program to test your mail folders and determine which kind your mail system creates. \date{2003 January 24} Modified the \.{winarch} target in \.{Makefile.in} to exclude any \.{CVS} directories it may encounter. Added |strstream|, |istrstream|, and |ostrstream| to the \CPP/ library type list in \.{cweb/c++lib.w}. \date{2003 February 15} Started to dig into compile incompatibilities in the ``new and improved'' libraries which accompany \.{gcc} 3.2.2. In the language lawyering verbiage below, ``Stroustrup'' refers to ``The \CPP/ Programming Language, Special Edition'' by Bjarne Stroustrup, ISBN ~0-201-70073-5. First of all, my local copy of |strstream| in \.{mystrstream.h} ran afoul of other changes in the ``standard'' library. I merged the \.{backward/strstream} and \.{backward/strstream.h} files from the 3.2.2 library and installed them as \.{mystrstream-GCC3.h}, which is included if \.{GCC3} is defined. I have yet to add the \.{autoconf} logic to detect this; at the moment I'm specifying this when I invoke the \.{Makefile}. An include of the now {\it verboten} \.{iostream.h} remained in \.{statlib.w}; I pulled the ``\.{.h}''. In addition, \.{statlib.w} ran afoul of the dreaded ``implicit typename is deprecated'' warning in GCC 3.2. I added the required |typename| qualifier before constructs such as |dataTable::iterator@, p| in the methods of |dataTable|. See section {\bf C.13.5} in Stroustrup for details. Previously, \.{gcc} treated the buffer argument of |ostream::write| like a \CEE/ |void *| pointer. Now one must explicitly coerce it with a |reinterpret_cast|. The same goes for |istream::read|, where the argument must be coerced with |reinterpret_cast|. This played havoc with our binary I/O code in |dictionaryWord| and |fastDictionary|, requiring ugly casts all around. I may go back and prettify these with a macro, but not before I get the sucker past all the other compile problems. In days of yore, when everybody knew that an STL |vector| was just a dynamically sized array, you were allowed to treat an iterator of the |vector| as a \CEE/ pointer to access the contents of the object, as long as you made sure all references were within bounds: no more. No longer can you, for example, write the entire contents of a |vector| to a stream with a single |write|. Instead, you must painfully iterate over every element in the |vector|, doing I/O on each one individually. This is potentially a huge performance hit which may motivate abandonment of the STL |vector| in favour of a \CEE/ array which can be written in one swell foop. Fortunately, all the cases where this occurs in \PRODUCT\ are in exporting |fastDictionary| objects, which happens so infrequently we don't care how fast it runs. \.{Gcc} 3.2 also complains if you declare the values of default arguments in a method within a class, then repeat them in the implementation declared subsequently. I've always written code this way, considering it to better document what's going on, particularly since the poor sucker who has to fix the code later on is probably going to be looking at the implementation and may be unaware of the default argument values declared back in the class definition. Well, it turns out that one can read section {\bf 7.5} of Stroustrup as prohibiting this pursuant to the ``default argument cannot be repeated or changed in a subsequent declaration in the same scope'' prescription and, indeed, the example of default arguments in class methods in section 10.2.3 is coded this way. Okay, what can I do but ``fix'' it, but to my mind this reduces the maintainability of the code. I think you should be able to use precisely the same declaration of the function in its definition and implementation, including default arguments and attributes such as |const|. The compiler should verify that they're identical, but then both the definition and implementation serve as stand-alone descriptions of the calling sequence and method properties. Oh, {\it come on}, guys! Now you're telling me I have to do a |reinterpret_cast| to |istream::read| into a bloody |unsigned char|! You can imagine what this did to |dictionaryWord::importFromBinaryFile|. Unfortunately, I not only had to imagine it, I had to do it. \date{2003 February 16} With \.{gcc} 2.96, when you include \.{math.h}, it doesn't define |abs| for |double|, as it's supposed to do according to section {\bf 22.3} of Stroustrup. Consequently, I defined my own |abs(double)| in the global context to get the job done. Well, on 3.2.2, the existence of this function creates an overloading ambiguity against the built-in one, which has now been added to \.{math.h}. It turns out that if you include \.{cmath} in 2.96, you {\it do} get |abs(double)|, although that file and \.{math.h} are documented as being identical. So, I replaced the include of \.{math.h} with \.{cmath} and eliminated my private copy of |abs|. Now it compiles on both of 'em. They've gone and eliminated |fstream::attach(int fd)| from the standard---just try and plumb a pipe into your input or output stream the way you effortlessly used to! As a first cut attempt to detour past this off-ramp to oblivion, I tried building with |HAVE_POPEN| undefined, and promptly fell into a self-dug abyss: bad conditional declaration of the file handle used to read compressed mail folders and messages in |mailFolder|. I fixed that, and for the first time, we actually built and passed ``\.{make{ }check}'' under 3.2.2! Just don't try it with compressed mail folders quite yet$\ldots$\,. Now, of course, we must deal with this. I installed the \.{fdstream.hpp} package developed by \pdfURL{Nicolai M. Josuttis}{http://www.josuttis.com/} in the source directory, extending it to permit declaration of |fdistream| and |fdostream| objects with a default file descriptor of zero, which can be specified later by a new |attach| method, thus requiring fewer changes to existing code which uses the |fstream::attach| mechanism. There is little or no error checking---you can screw things up mightily by swapping file descriptors on the fly, but then you could before with |fstream::attach|! To test this class and dip my toe into the acid bath of post-|fstream::attach| plumbing, I modified |pdfTextExtractor| to use |fdistream| to read the pipe from \.{pfdtotext}, which is a simpler case than the tangle associated with compressed file decoding. This worked the first time, meaning I should look over my shoulder when migrating the |attach| references in the compressed file code to the new mechanism. Note that the existing code has lots of {\it ad hoc} tweaks, all tagged with \.{OLDWAY}, to enable the currently-working code. Before we're ready to ship, all of the OLDWAY dust-bunnies should be cleaned up and a clean build and regression test run on 2.96 and 3.2.2 parameterised exclusively by the \.{configure} script. Added code to |mailFolder| to use a new |fdistream| to read the pipe when decompressing mail folder files and compressed files in mail directories. In the \.{gcc} 3.2.2 library, closing and opening an |ifstream| does not clear |ios::eofbit| in the descriptor as it used to. (I consider this a stone bug---when you close one file and open another, only an idiot would consider the end of file condition from the previous file still asserted.) In any case, I added a |clear()| of the |ifstream| we use while traversing a directory in |@| so this doesn't sabotage reading messages in a directory. Re-tested directory traversal, with and without compressed files in the directory, on \.{gcc} 2.96 and 3.2.2 to verify the modified code works on both. It does. \date{2003 February 18} Added logic to \.{configure.in} to test whether the \CPP/ library is compatible with the \.{fdstream.hpp} package. If so, we use it; otherwise we assume it's an old library which supports the |attach| method for |fstream| I/O. The \.{config.h.in} variable |HAVE_FDSTREAM_COMPATIBILITY| will be defined if \.{fdstream.hpp} is to be used. Added a test to \.{configure.in} which determines whether the \CPP/ library is compatible with the new \.{mystrstream\_new.h}. If so, it's included. Otherwise, the earlier \.{mystrstream.h} is used as before. If the new |strstream| package works, |HAVE_NEW_STRSTREAM| is defined in \.{config.h.in}. With these changes, the source configures and builds correctly on \.{gcc} 2.96 and 3.2.2 without any tweaks or changes. As suggested by Kern Sibbald, I changed the default \.{--phraselimit} to 48 characters. As reported by Jim Hamilton, some mail systems which store individual messages as separate files in folder directories do not prefix each message file with the ``\.{From\ }'' sentinel we were counting to mark message boundaries. This resulted in bad message counts, affecting probability computation and, worse, failure to reset decoder modes, etc.\ after a mailformed message. I added a new |expectingNewMessage| flag, which is set at the start of every new file |mailFolder| reads (whether a composite mail folder or a file within a directory). When |expectingNewMessage| is set, the first line of the file with a nonblank character in the leftmost character position is considered the start of a new message regardless of its contents. \date{2003 February 19} Added the ability to parse a composite mail folder file using either pure BSD (``\.{From\ }'' always denotes start of message and is quoted in every other case) from ``consensus \UNIX/'' format, where ``\.{From\ }'' only marks the start of a new message when it appears after a blank line. Sun ``\.{Content-Length:}'' folders are {\it not} supported, as they were a disastrously poor idea---you can generally treat them as usual \UNIX/ folders. By default, folders are parsed using \UNIX/ semantics. A new \.{--bsdfolder} option marks the following \.{--mail} or \.{--junk} folder as following BSD rules. Note that you must specify \.{--bsdfolder} before {\it each} BSD-style folder; it is not modal. This is a change in default behaviour: folders were previously parsed using BSD rules, while \UNIX/ is now the default. The very large |case| statement which processes command line options ran afoul of \.{CWEAVE}'s maximum token per scrap capacity limit. I added a \.{cweb/cweave-bigger.ch} file to increase the limit to 5000 tokens (from 2000), and modified \.{cweb/Makefile} to apply the change file when building \.{CWEAVE}. I probably ought to break the option processing |case| into one piece for each option, but as there's little or nothing to be said about each one, that really wouldn't improve the readability of the code. \date{2003 February 20} Completed the implementation of \.{--autoprune}. This new option permits you to specify a memory size, in bytes, at which a dictionary to which words are being added with the \.{--mail} or \.{--junk} options will be automatically be pruned by discarding all words which appear only once. A new |dictionaryWord::estimateMemoryRequirement| method estimates the memory occupied by an in-memory word, and this is used to compute the total dictionary size. |dictionary::purge| has been extended to accept an optional argument which, if nonzero, causes the pruning of the dictionary to be based on the number of occurrences of the word rather than our ability to compute its probability. If the user sets \.{--autoprune} too low, we can fall into a trashing situation when the non-unique words in the dictionary exceed the pruning threshold. To keep this from happening, whenever the dictionary size after an automatic prune exceeds 90\% of the \.{--autoprune} threshold, the threshold is increased by 25\%. \date{2003 February 21} Modified the \.{makew32.sh} script to build with \.{gcc} 3.{\it x} rather than 2.{\it x}. Note that this means the source should be \.{./configure}d for a \.{gcc} 3.{\it x} build before creating \.{winarch} to transport to the Cygwin machine. When building on Cygwin with \.{gcc} 3, \.{getopt.h} managed to get included twice for some reason. I changed the condition around our local copy to |__GETOPT_H__| to agree with the symbol in the library include to prevent this from happening. Updated the \.{cygwin.dll} included in the Win32 executable distribution to the January 24, 2003 version we're currently using on Ovni. Release 1.0. \date{2003 June 24} As reported by and fixed by Wolfgang Schnerring, \.{utilities/splitmail.pl} had an assignment statement in the \.{dispose\_of\_message} subroutine which was missing the dollar sign before the variable name. I integrated his fix. Thank you! \date{2003 August 27} A |pdfTextExtractor| was not restartable---once instantiated, it could only be used once; calling |close| and then re-initialising with the parent |applicationStringParser| class |setMailFolder| left the extractor at end of file. This required fixes both in |pdfTextExtractor|, where the |close| method failed to reset |initialised| to |false|, and in |applicationStringParser|, whose |close| method did not reset the |eof| and |error| flags. \date{2003 August 28} Added a parser diagnostic to |mailFolder::nextLine| to indicate when an |applicationStringParser| is closed. The |close| method of |pdfTextExtractor| failed to close the input stream it used to read the output from the pipe connected to \.{pdftotext}, which caused (for some bizarre reason), the raw binary PDF file to be returned, not the decoded text. I added the requisite |close| of the stream. When |pdfTextExtractor| was transcribing the decoded attachment to the temporary file to be read by \.{pdftotext}, it checked for end of file but not error conditions. I modified it to use |isOK()| to govern the copy loop. The |flashTextExtractor| and its parent |flashStream| were not restartable because they did not propagate the |close| up to the |applicationStringParser| from which all are derived, and because |flashTextExtractor| did not reset its own |initialised | and |textOnly| at end of file. Fixed. Because the |flashStream| decoder usually terminates upon seeing a \.{stagEnd} tag in the input stream, it failed to read from the MIME decoder until end of file was encountered. This caused an extraneous blank line to be inserted in the transcript at the end of the MIME-encoded data and before the part end sentinel. I added logic to |flashTextExtractor::nextString| to call |get8()| until an end of file is reported before returning the logical end of file for the flash stream. The input stream |close| I added to |pdfTextExtractor::close| ran afoul of the |fdistream| logic used to cope with \.{gcc} 3 which, helpfully, does not define a |close| method. I made the |close| conditional on |HAVE_FDSTREAM_COMPATIBILITY| not being defined. This time, our attempt to rebuild the Win32 version was torpedoed by |getopt| in yet another innovative way. This time, the care we took to avoid including our own \.{getopt.h} stabbed us in the back, because the library's version (which I still haven't figured out the reason it's being included) doesn't define the long version of |getopt|, and wants a different symbol to do so than our include file. So, I added |WIN32| conditional code before the include of our version to force it to be included and define the long option version of |getopt|. This GCC/Cygwin ``compatibility'' is turning out to be a running bad joke. Release 1.0a. \date{2003 September 23} A file whose name contained the string ``\.{.gz}'' (or whatever other compressed file extension was configured) would be fed through the decompressor even if the sequence was embedded in the middle of the file name. I modified the tests to deem a file compressed only if the |Compressed_file_type| string appears at the end of the file name. This applies both to files named directly on the command line and files within directories. A PDF file which has been marked by its creator as view-only will not be processed by \.{pdftotext}---no output is generated and the message ``{\tt Error: Copying of text from this document is not allowed.}'' is sent to standard output. There's nothing we can do about this, absent making a version of \.{pdftotext} which bypasses the PDF file security mechanisms. While there's something to be said for this, it's well beyond the mandate of \PRODUCT . An assertion added to |flashStream::ignoreTag| in the process of debugging problems due to multiple flash attachments could fail when \.{--bsdfolder} mode was used to scan a mail or junk folder. I commented out the assertion. \date{2003 September 24} Phil Karn (KA9Q) reported that on the latest Debian distribution, compilations failed due to a missing definition of |assert|. As far as I can determine, \.{assert.h} was pulled in by other includes in earlier libraries, but now must be included explicitly. I added the requisite includes to \.{annoyance-filter.w} and \.{statlib.w}. Release 1.0b. \date{2003 October 8} Good ole' \.{gcc} 3.3.1 has taken to complaining if you compare |c<=255| where |c| is an |unsigned char|. Well, of course, this cannot be |false|, but it makes perfect sense when used in an assertion within a macro which only writes out a single byte of its argument. But of course we must sacrifice sound engineering safeguards in the interest of the \.{gcc} thought police's finely-honed sense of purity. So, out goes the assertion in |dictionaryWord::exportToBinaryFile| which guards against any programmer accidentally using |outCount| to write a value which may be larger than 255. Suppose you want to initialise a character class or translation table. What could be more natural than to write, say, ``|ctab['.']=CC_PUNCT|''? Well, not if you have to compile with \.{gcc} 3.3.1 which deems using a |char| as a subscript a distinctly unnatural act meriting a compile-time warning. I added a dumb macro to wrap |static_cast()| around the character table entries initialised in this way in |base64MIMEdecoder::initialiseDecodingTable|, |tokenDefinition::setISO_8859defaults|, and |tokenDefinition::setUS_ASCIIdefaults|. Ooooh, that's {\it soooooh} much cleaner now! \date{2004 March 23} Added a kludge attempt to work around the messages which have begun to arrive with no blank line after the header, which runs directly into the first MIME header. At the moment, the code is just hammered into |mailFolder::nextLine| conditionally compiled on |SLOPPY_HEADERS|. If it does the job, I will make it a proper option and add it to the documentation. \date{2004 March 30} When configured for a system with |HAVE_MKSTEMP|, PDF decoding left a file descriptor pointing to each temporary file it created open. The error was due to |mkstemp|'s returning a handle to the open file as well as the file name. The code in |@| went ahead and opened the file on its own, as is done after obtaining a name from |tmpnam|. I modified the code to use the open file handle from |mkstemp| and ensure it's closed when we're done writing the file. Of course this puts us right on the tracks where the |HAVE_FDSTREAM_COMPATIBILITY| locomotive is bearing down on us, necessitating conditionals all around to handle the removal of file handle I/O from |fstream| in later GCC libraries. Release 1.0c. \date{2004 August 4} The latest twist of the knife by the GCC priesthood in G++ 3.4 struck the |dataTable| template class in \.{statlib.w}. The ``enhancement'' which torpedoed this code which previously compiled without warnings in \.{-Wall} mode is ``described'' by this fine \pdfURL{piece of gibberish}{http://gcc.gnu.org/onlinedocs/gcc/Name-lookup.html} from the G++ documentation. After you've sorted through all the scholasticism, the bottom line is that any template class you define which is derived from an STL template such as |vector| and innocently uses methods from the parent class such as |size|, |begin|, or |end|, will now blow off with a compile error unless you qualify each of these method references with |this->| or |vector::|. How I regret ever getting involved with this crap-bag language and compiler! The latest version of the the macros for CWEB, \.{cweb/cwebmac.tex} fails with \TeX\ 3.14159 (Web2C 7.3.1) because \.{pdfURL} is defined inside a block of code which is only processed by pdf\TeX. I moved the definition down to the bottom of the file, and now everything seems to work OK{}. To eliminate the need to install \.{cwebmac.tex} in the \TeX\ directory tree and avoid possible version incompatibilities, I added an environment variable set to the invocations of \TeX\ and pdf\TeX\ in \.{Makefile.in} to force the copy in our own \.{cweb} directory to be used. Rebuilt \.{configure} using Autoconf 2.59, which corrects the idiotic \.{char} function type in the |AC_CHECK_FUNCS| macro which was causing |mkstemp| and |system| not to be detected even on systems on which they are present. Promoted the |SLOPPY_HEADERS| code to a full-fledged option, \.{--sloppyheaders}. By default, it is off. To enable parsing of MIME headers with missing blank separators, specify the option both when training and testing messages. Release 1.0d. %%%%%%%%% Add new entries before this line %%%%%%%%% \parskip=0pt plus1pt \parindent=20pt