bogofilter

Name

bogofilter — fast Bayesian spam filter

Synopsis

bogofilter [-s] [-h] [-S] [-H] [-p] [-d] [-v] [-V]

DESCRIPTION

Bogofilter is a Bayesian spam filter. In its normal mode of operation, it takes an email message or other text on standard input, does a statistical check against lists of "ham" and "spam" words, and returns a status code indicating whether or not the message is spam. Bogofilter is designed with fast algorithms, uses the Berkeley DB for fast startup and lookups, coded directly in C, and tuned for speed, so it can be used for production by sites that process a lot of mail.

THEORY OF OPERATION

Bogofilter treats its input as a bag of tokens. Each token is checked against "ham" and "spam" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds 0.9, bogofilter returns 0, otherwise 1.

While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam is recommended reading.

This program substantially improves on Paul's proposal by doing smarter lexical analysis. In particular, hostames and IP addresses are retained as regognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are discarded so as not to bloat the word lists. Lex's Swiss-army-knife nature rises again.

The input may be one message or many. Messages are broken up on From_ lines. The algorithm is relatively insensitive to message miscounts.

For speed, MIME and other attachments are not decoded. Experience from watching the token streams suggests that spam with enclosures invariably gives itself away through cues in the headers and non-enclosure parts.

OPTIONS

Without command-line options, bogofilter returns 1 if the message is non-spam, 0 if it is spam. The non-spam wordfile is created if absent.

The -s tells bogofilter to register the text presented on standard input as spam. The spam wordfile is created if absent.

The -h tells bogofilter to register the text presented on standard input as non-spam.

The -S tells bogofilter to register the text presented on standard input as spam and to undo a prior registration of the same message as non-spam.

The -H tells bogofilter to register the text presented on standard input as non-spam and to undo a prior registration of the same message as spam.

The -d allows you to set the directory under which wordlists will be found (normally $HOME/.bogofilter).

The -l lists wordlists. Used with -h, it lists the ham list; used with -s, it lists the spam list.

The -p (passthrough) option writes a copy of the input mail to the output with an X-Spam-Status header (in the style of SpamAssassin) inserted. The header will begin with "Yes" or "No" according as the mail is judged to be spam or non-spam respectively.

The -v option produces a report to standard output on bogofilter's analysis af the input. The report lists the tokens with highest deviation from a mean of 0.5 association with spam.

The -V option causes bogofilter to print out the version number and wordlist directory, then exit.

INTEGRATION WITH OTHER TOOLS

The following procmail rule will take mail on stdin and direct it to Mail/spam if bogofilter thinks it's spam:

:0HB:
* ? bogofilter
Mail/spam

If bogofilter fails (returning 2) the message will be treated as non-spam.

The following recipe (a) spam-bins anything that bogofilter rates as spam, (b) adds the words in messages rated as spam to the spam wordlist, and (c) adds the words in messages rated as non-spam to the non-spam wordlist. With this in place, it will normally only be necessary for the user to intervene (with -H or -S) when bogofilter miscategorizes something.

   :0HB
    * ? bogofilter
    {
            :0c
            | bogofilter -s

            :0
            $MUTT/spam
    }

    :0EHBc
    | bogofilter -h

There have been numerous requests for a bogofilter option to do the above, but the current implemention would make this quite painful. The procmail recipe is the best way for now.

The following .muttrc lines will create mutt macros for dispatching mail to bogofilter.

macro index d "<enter-command>unset wait_key\n<pipe-entry>bogofilter -h\n<enter-command>set wait_key\n<delete-message>"
macro index \ed "<enter-command>unset wait_key\n<pipe-entry>bogofilter -s\n<enter-command>set wait_key\n<delete-message>"

RETURN VALUES

0 for spam; 1 for non-spam; 2 for I/O or other errors.

Error 2 usually means that the wordlist files bogofilter wants to read at startup are missing.

FILES

~/.bogofilter/hamlist: List of ham tokens.
~/.bogofilter/spamlist: List of spam tokens.

BUGS

bogofilter counts messages on input by looking for From_ lines. As a special case, a single message without From_ line is counted correctly. Multiple messages without intervening From_ lines will be counted as one message.

AUTHOR

Eric S. Raymond <esr@thyrsus.com>. For updates, see the bogofilter project page.