Spam, What Spam?

Articles: 

I get a lot of email; mostly spam. Fortunately, my email setup includes procmail and SpamAssassin, so I don't actually see any of that spam email. In conjunction with SpamAssassin, my procmail rules move all emails that fit the criteria for spam, and other abusive messages into a "Spam" mbox file, and I never have to see them. These captured messages are later fed into SpamAssassin's sa-learn to reinforce the Baysean spam filtering system, and then are discarded, unread.

But, neither SpamAssassin nor my additional procmail rules are perfect, and about 1% of the mail in my spamtrap is actually "ham" (wanted, good email). So, at intervals, I have to rescue stray emails from the spamtrap and manually place them in my inbasket. Rather than read through all the spamtrapped email, to locate potential "keepers", I let my system check them for me. This is a rudimentary check, which only summarizes the email for me; if I recognize the email as good from that summary, I rescue it from the spamtrap.

Look and Feel

So, what do I report, and how do I do it? The what is simple enough; I report the date that the spam was trapped, the reported sender of the email, the target email address (if it is not my usual address), the Subject, and which component caught the spam. A script runs this report daily, and emails me the results (being careful not to get caught in the spamtrap itself). The resulting daily email looks something like...

Principles of Operation

I wrote these spam archiving and reporting scripts to fit within the framework of email management that I had built for my network. To understand the scripts, you must understand how my email system works with spam.

Email destined for my network arrives at my server's sendmail MTA. Sendmail will reject any connection that tries to send mail to an address not supported on my network, and this eliminates a fair bit of mis-addressed email, along with the usual attempts to use my system as an open relay.

Sendmail hands any local email that passes this initial filter criteria off to procmail for processing and delivery; as my users all have email accounts with this server, sendmail drops most incoming email directly into procmail.

I divide my procmail rules into two parts:

  1. general processing (in /etc/procmailrc) that apply to all emails, and
  2. per-user processing (in $HOME/.procmailrc) that can vary depending on the recipient.

One of the general rules invokes SpamAssassin to classify the incoming email. SpamAssassin also divides its ruleset into general rules, and user-specific rules. But, at this stage, even though SpamAssassin applies both the general, and the local recipient's user-specific rules, the procmail recipe only marks the email as SPAM or HAM. The general procmail rule for SpamAssassin looks like:

#-------------------------------------------------------------------------
# Grade the mail for spam using user's spamassassin settings
:0 fw
| /usr/bin/spamc -u $LOGNAME
#-------------------------------------------------------------------------

If the user so wishes, their user-specific procmail recipes can contain further processing. In my case, my user-specific procmail rules include a rule to check the SPAM/HAM markings left by SpamAssassin, and perform special handling on those identified as SPAM. For those SPAM emails, my user-specific procmail rule will add an "X-Filter" header indicating that the email was intercepted by SpamAssassin, and reroute the email so that, instead of delivery, it ends up concatenated to the end of an mbox file called $HOME/spam/Today.

This user-specific procmail recipe look like:

#-------------------------------------------------------------------------
:0 H : spam.lock
* X-Spam-Flag: YES
|  formail -A "X-Filter: SpamAssassin" >>$HOME/spam/Today
#-------------------------------------------------------------------------

An additional set of user-specific procmail recipes handles HAM messages. Although SpamAssassin permits user-specific spam recipes, the practice is discouraged because of security exposures. This means that, while the end-user may be able to identify some SPAM messages by specific content, he or she may not be able to add that sort of classification to SpamAssassin's automated rules. To get around this restriction, the user can provide additional filtering procedures which are initiated by their user level procmail recipes. Should the email match one of these filters, the user-level procmail recipe would again add an "X-Filter" header indicating that the email was intercepted (this time by the user's internal test), and concatenate it to the end of the $HOME/spam/Today mbox.

This sort of procmail recipe would look something like:

#-------------------------------------------------------------------------
:0 B : spam.lock
* ? egrep -is -f $HOME/procmail.keywords
|  formail -A "X-Filter: procmail.keywords" >>$HOME/spam/Today
#-------------------------------------------------------------------------

So, why file spam into an mbox? Well, spam filtering isn't 100% accurate; you do get some "false positives". Since I don't want users to lose important emails that just happen to get misclassified as "spam", the spam intercept places the spam in an mbox file, and reports on the contents. The user can determine if the mbox contain a misclassified "ham" message from a contents report, extract the email, and adjust their spam filters so that further emails of that nature do not get caught in the spamtrap.

The system keeps two weeks (14 days) of daily spam mbox files, so that, should the user delay in extracting ham from them, the contents won't be lost. The process deletes mbox files that are more than two weeks old, so as to limit the amount of space dedicated to holding spam. However, before deleting the old spam mbox files, the process feeds the spam into SpamAssassin's sa-learn facility, in order to give the SA Bayesian filter more data to work with.

The $HOME/spam/ directory contains a number of files. First off, there's the spam mbox files for each day. They all follow a specific naming convention: spam.YYYY-MM-DD, where the YYYY-MM-DD is the numeric date for the day that the spam was captured on. Also in this directory is a symlink Today, which links to the spam.YY-MM-DD file for the current day. Finally, there is the Yesterday symlink, which links to the spam.YY-MM-DD file for the previous day.

~/spam $ ls -l 
total 3496
lrwxrwxrwx 1 lpitcher users     15 2010-08-09 00:00 Today -> spam.2010-08-09
lrwxrwxrwx 1 lpitcher users     15 2010-08-09 00:00 Yesterday -> spam.2010-08-08
-rw-r--r-- 1 lpitcher users 234279 2010-07-27 09:58 spam.2010-07-26
-rw-r--r-- 1 lpitcher users 290229 2010-07-28 12:28 spam.2010-07-27
-rw-r--r-- 1 lpitcher users 249098 2010-07-28 22:34 spam.2010-07-28
-rw-r--r-- 1 lpitcher users 244065 2010-07-29 22:56 spam.2010-07-29
-rw-r--r-- 1 lpitcher users 255498 2010-07-31 08:11 spam.2010-07-30
-rw-r--r-- 1 lpitcher users 238583 2010-08-01 10:54 spam.2010-07-31
-rw-r--r-- 1 lpitcher users 140870 2010-08-02 08:30 spam.2010-08-01
-rw-r--r-- 1 lpitcher users 269474 2010-08-03 09:40 spam.2010-08-02
-rw-r--r-- 1 lpitcher users 210030 2010-08-04 08:01 spam.2010-08-03
-rw-r--r-- 1 lpitcher users 292730 2010-08-05 09:08 spam.2010-08-04
-rw-r--r-- 1 lpitcher users 251798 2010-08-06 07:40 spam.2010-08-05
-rw-r--r-- 1 lpitcher users 200868 2010-08-07 09:22 spam.2010-08-06
-rw-r--r-- 1 lpitcher users 165157 2010-08-07 22:29 spam.2010-08-07
-rw-r--r-- 1 lpitcher users 229074 2010-08-08 23:46 spam.2010-08-08
-rw-r--r-- 1 lpitcher users 216626 2010-08-09 19:55 spam.2010-08-09
~/spam $

The code

Prerequisites

The following software are general prerequsites for processing email:

  • sendmail or some other MTA, to receive incoming email,
  • procmail (including formail) to permit site-specific processing of local delivery email,
  • SpamAssassin, (including spamd, spamc, and sa-learn), to analyze and classify spam email

The following utilities are used by the MoveSpam.sh and SummarizeSpam.sh scripts:

  • bash shell,
  • date date utility,
  • touch file utility,
  • ln file management utility,
  • rm file management utility,
  • fastmail email batch MUA
  • sa-learn SpamAssassin Bayesean learning tool
  • formail procmail mail formatter
  • awk pattern matching and scripting language

MoveSpam.sh

A user crontab entry runs the MoveSpam.sh script just at midnight each day. This script controls all the archiving, reporting, and cleanup functions related to the management of the daily spam files.

If the user started the script with an externally-set SPAMDIR environment variable, the script uses that value as the path to the spam archive directory. If no SPAMDIR environment variable is present, the script defaults the spam archive directory path to $HOME/spam.

First, the script creates a new, empty spam archive, using the current date in the name, and then relinks the Today symlink to this file. From this point on, any email captured to the spamtrap (via the immutable procmail rules that write to $HOME/spam/Today) will wind up in this new file, and not in the previous day's file. At this point, the script also adjusts the Yesterday symlink to point at the (now) previous day's dated spam file.

Now, the script invokes the SummarizeSpam.sh script, passing it a parameter of yesterday, and piping the script's stdout to the fastmail program in order to generate the email report. SummarizeSpam.sh will generate the daily spam report (in this case, for the previous day's spam), and fastmail will deliver the report to the end-user.

Finally, the script will locate all spam.YYYY-MM-DD files with dates more than 14 days prior to the current date, and individually feed their contents into SpamAssassin's sa-learn program as spam. Each file given to sa-learn is then deleted, maintaining a 14-day archive of unprocessed spam. This 2 week window gives the end user plenty of time to rescue misclassified ham emails from the spam files before it is used for SpamAssassin reinforcement and deleted.

SummarizeSpam.sh

The MoveSpam.sh script invokes SumarizeSpam.sh to summarize any spam messages caught in the previous day's spam mbox file. SumarizeSpam.sh accepts a single commandline argument indicating the date of the mbox to summarize (in YYYY-MM-DD format, or as the string "Yesterday", "yesterday", "Today", or "today"),.

If the user started the script with an externally-set SPAMDIR environment variable, the script uses that value as the path to the spam archive directory. If no SPAMDIR environment variable is present, the script defaults the spam archive directory path to $HOME/spam. Similarly, if the user started the script with an externally-set AWKDIR, the script uses that value as the path to the awk script directory in which the SummarizeEmail.awk script will be found. If no AWKDIR is present, the script defaults the script directory to /usr/local/share/scripts.

The script first evaluates the given argument to determine the date (in YYYY-MM-DD format) of the target spam mbox ($SPAMDIR/spam.YYYY-MM-DD). If it cannot locate the mbox, the script will issue an error message to stderr, and terminate with an exit code of 1.

However, if the mbox file exists, the script writes an appropriate title to stdout, and invokes formail (using the -s postprocessing option and $AWKDIR/SummarizeEmail.awk) to summarize the contents of the selected spam mbox to stdout.

The -s argument to formail causes formail to split the mbox into individual email messages, and feed each message separately into a user-specified program. In our case, we pass "-s awk -f $AWKDIR/SummarizeEmail.awk", which causes formail to pass emails from our spam mbox individually into the SummarizeEmail.awk script.

Once the summary is complete, the script then terminates with an exit code of 0.

SummarizeEmail.awk

The SummarizeSpam.sh script passes each individual spam email into the awk SumarizeEmail.awk script. This script will process and report just the email headers, and will ignore the email body.

The script sets the default values for the "To" and "Subject" fields. It is entirely possible that a spam email will not contain one (or both) of these fields, and as the script will always report these values, we need reasonable defaults for them. The script will override these values as it encounters the appropriate email header fields.

The script captures the email address associated with both the "envelope" From and header From: fields, and converts each to lower case. It also captures and converts to lowercase the email address in the To: header.

If present, the script captures the entire Subject: header, overwriting the default "Subject".

If any of our home-grown spam detection logic has marked this email, the script will find a X-Filter: header, and capture the value which indicates the name of the filter that this email matched.

When the script runs into the first line of the email body, it will stop capturing information, and generate the email report. This report describes

  • The sender's email address, and the alias he sent using
  • The receiver's email address, if it isn't one of my public email addresses
  • The email Subject, and
  • The name of the filter which caught the email

Typically, this looks something like...

# From: toddferguson@fadmail.com AS santiagobruce@aktabutiken.com
  Subject: Subscribe for a new bonus package
  Caught by: SpamAssassin

# From: ibsn66@gmail.com
  To: Undisclosed-Recipients:;
  Subject: RE: HELLO BRETHREN
  Caught by: SpamAssassin

The distribution files

The SummarizeSpam.1r0.tar.gz archive contains the following files:

Makefile
A rudimentary makefile, usable only to "make install" as root.
MoveSpam.sh
A shell script to archive spam mboxes, rolling old mboxes off to sa-learn. This script will invoke the SummarizeSpam.sh script to summarize the newly filled mbox after archiving it.
README.txt
A file containing some descriptive detals about the package
SummarizeEmail.awk
An awk script that will summarize the from and subject of a single email
SummarizeSpam.sh
A shell script to summarize the contents of a single mbox file. This script will invoke SummarizeEmail.awk on each single email message found in the specified mbox.
licence.txt
The GPL v2 licence that applies to MoveSpam.sh, SummarizeSpam.sh, and SummarizeEmail.awk
filelist.md5
MD5 checksum of all the files in this package, excluding filelist.md5
Development: 
AttachmentSize
Binary Data SummarizeSpam.1r0.tar.gz9.67 KB