I am a user, tester, and occasional developer of SpamAssassin. There's
a lot of semi-accurate information about SA floating around the Net.
First of all, SA does not use a specific or "arbitrary" method of
identifying spam. Instead, it is an open platform that has a number of
different techniques plugged into it (and is extensible if you want to
write/add your own). SA techniques include:
* Internal pattern-matching on header & body parts
* Second-order (syntactic) analysis of patterns
* Analysis of embedded code (JavaScript, HTML, etc.)
* Automatic (feedback-based) whitelist and blacklist processing
* Use of external blocking lists like MAPS RBL, DUL, Osirus, Ordb.org,
SpamCop, RFCI, et al.
* Use of Vipul's Razor (known spam database)
Future plans include hooks for Bayesian Filtering.
The rulesets and scoring are repeatedly applied to a set of spam,
nonspam, and mixed message corpuses, using a genetic algorithm, to
determine scores. They are absolutely not "arbitrary", i.e., having a
person decide a particular word or phrase is "spam" or not.
The reason that SA works so well -- and I believe that it's the best at
what it does -- is that there is no one "best" way to identify spam.
There are multiple techniques with varying degrees of success, and if
you combine them all, and allow a self-correcting feedback technique
determine the score (likelihood of a message being spam) you get a very
high degree of success. Plus the ability of any user to override
various rules and scores to meet his/her individual needs.
--
Michael C. Berch
mcb@postmodern.com
References:
|
|