Spam filtering with Bogofilter

I personally prefer not to use mail services like gmail and manage my mails on my own server. But I receive a lot of spams everyday which is a real pain. You can use any solution you want to hide your email address, those spammers always somehow manage to get through.

Until now, I’ve been using client-side spam filtering with Thunderbird which is a simple solution but has a number of drawbacks : it uses more bandwith and above all if you read your emails via a client on another computer or via a webmail, spams are not filtered. So a few days ago, I finally installed Bogofilter on my server which turned out to be very easy. I’ve chosen this solution for its statistical approach (bayesian filtering) and because it is said to be faster than SpamAssassin (Bogofilter is written in C, SpamAssassin in Perl).

My environment was : Debian Sarge, postfix, courier-imap and procmail. My mails are in Maildir format (one file per mail). So the first step was to feed the database with spam and ham (non-spam). Fortunately, I keep all my spams (deleting spams is generally a bad idea). I wrote the following little script (first_time.sh) to do so:

#!/bin/bash
echo "Feeding db with spam"

badmails=$(find ~/Maildir/.spam.checked/{cur,new,tmp} -type f)

for mail in $badmails
do
        bogofilter -s < $mail
done

echo "Feeding db with ham"

goodmails=$(find ~/Maildir/ -type f | grep -v spam | grep "bahamut.ffworld.com")
for mail in $goodmails
do
        bogofilter -n < $mail
done

Basically, this script feeds the database with all the spams from my spam/checked/ directory and with the hams from the other directories. On my server (2.6Ghz Celeron with 1GB RAM), it took about 5 hours to process my 30 000 spams and 10 000 hams reservoir and the resulting database (in ~/.bogofilter/) was 30 MB.

Then I used the following rules for my procmailrc :

:0fw
| bogofilter -u -e -p

# if bogofilter failed, return the mail to the queue, the MTA will
# retry to deliver it later
# 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h

:0e
{ EXITCODE=75 HOST }

# file the mail to spam if it's spam.

:0:
* ^X-Bogosity: Spam, tests=bogofilter|
  ^X-Bogosity: Unsure, tests=bogofilter
.spam.waiting/

Bogofilter filters mails and adds an header to them. Here is an example of header for a 100% sure spam : X-Bogosity: Spam, tests=bogofilter, spamicity=1.000000, version=0.94.4 . I put procmail rules for mailing-lists before thoses rules so bogofilter does not have to process mails coming from lists.

Spams are then moved to the spam/waiting/ directory thanks to those headers. I use this directory to check Bogofilter did not mistake. I also use two directories spam/false_spam/ and spam/false_ham/. I move mails that were not correctly guessed to those directories via my email client. Fortunately, Bogofilter usually guesses correctly. Most importantly, until now, I’ve never had goods mails that were seen as spam. The results are very good.

Finally, I use this script (which I put in cron.daily) to correct the mistakes.

maildir=/home/matt/Maildir
checkeddir=$maildir/.spam.checked/cur/

# False hams

badmails=$(find $maildir/.spam.false_ham/{cur,new,tmp} -type f)

for mail in $badmails
do
        bogofilter -Ns < $mail
        mv $mail $checkeddir
done

# False spams

goodmails=$(find $maildir/.spam.false_spam/{cur,new,tmp} -type f)

for mail in $goodmails
do
        bogofilter -Sn < $mail
        mv $mail $maildir/cur/
done

# Moving waiting mails to checked mails after one week

waitingmails=$(find $maildir/.spam.waiting/{cur,new,tmp} -type f -and -mtime +7)

for mail in $waitingmails
do
        mv $mail $checkeddir
done

In the end, my mails are filtered and I correct the mistakes like with Thunderbird, but without the drawbacks.

Leave a Reply