THEORY OF OPERATION

Bogofilter treats its input as a bag of tokens. Each token is checked against "ham" and "spam" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds 0.9, bogofilter returns 0, otherwise 1.

While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam is recommended reading.

This program substantially improves on Paul's proposal by doing smarter lexical analysis. In particular, hostames and IP addresses are retained as regognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are discarded so as not to bloat the word lists. Lex's Swiss-army-knife nature rises again.

The input may be one message or many. Messages are broken up on From_ lines. The algorithm is relatively insensitive to message miscounts.

For speed, MIME and other attachments are not decoded. Experience from watching the token streams suggests that spam with enclosures invariably gives itself away through cues in the headers and non-enclosure parts.