Bayesian Noise Reduction - Contextual Symmetry Logic http://bnr.nuclearelephant.com Copyright (c) 2004 Jonathan A. Zdziarski v2.0 LICENSE This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ABOUT BAYESIAN NOISE REDUCTION Modern day language classification requires the use of machine learning, which relies heavily on presented learning input. Most of today's algorithms (Bayes, Chi-Square, etcetera) are inherently sound and accurate, however regardless of which algorithm is used, a great deal of the algorithm's accuracy is related directly to the quality of data provided - the Garbage In, Garbage Out theory. Bayesian Noise Reduction is a statistical approach to evaluating coherence using pattern consistency checking. BNR attempts to solve the problem commonly referred to as "Bayesian Noise" which, in its simplest definition, refers to irrelevant or incoherent data present in a message being classified. Bayesian Noise Reduction dubs this text in order to provide cleaner classification and is implemented as a "pre-filter" to existing language classification functions. libbnr is an implementation of the Bayesian Noise Reduction (BNR) algorithm which I originally designed to counter directed attacks in spam. As Dr. John Graham-Cumming illustrated at Spam conference 2004, most statistical language classifiers are quite resilient to random word attacks, but fail miserably against directed attacks where the spammer has mined intelligence about the target user(s) and purposely injected text that is context-specific to the target, which can fool spam filters into believing the message is legitimate. Come to find, after writing version 2.0 of the algorithm, it was quite efficient at filtering out all types of noise from all types of text samples. Whether you're writing a spam filter, document classifier, or performing some type of Bayesian intrusion detection, the noise reduction library can help to improve the quality of your classifications. A full explanation of the algorithm can be found in my white paper at http://bnr.nuclearelephant.com. In simple terms, the BNR algorithm uses pattern consistency checking to identify IN-consistent data. The library requires two different sets of input from the implementor: 1. A stream of _ordered_ tokens (words, nGrams, etcetera) and their associated p-values (probabilities). This could be a message body or other input. 2. After a call to bnr_instantiate(), a set of patterns will be instantiated. These patterns must also be tracked in the classifier (according to the white paper, which treats them similar to any other token) and their probabilities must also be fed into the noise reduction context. Once both pieces of data have been provided, the noise reduction algorithm will perform its analysis and provide an output stream of what's left over. libbnr can be linked in with your classifier and called using the standard C interface. An example has been provided (example.c) to show developers how to integrate the tool properly. One final note, if your classifier implements nGrams, it is usually best to create a separate BNR context and process each set of nGrams separately. One stream for single tokens and another for biGrams, etc., will yield the best results. BUILDING ./configure && make && make install LINKING Compile your application with -lbnr CODING See example.c for more information Jonathan A. Zdziarski jonathan@nuclearelephant.com