Great Circle Associates List-Managers
(October 1997)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: validity of Message-Id's
From: "Ronald F. Guilmette" <rfg @ monkeys . com>
Date: Sat, 25 Oct 1997 18:03:39 -0700
To: List-Managers @ GreatCircle . COM
In-reply-to: Your message of 17 Oct 1997 10:42:47 -0400. <w4k9fc1me0.fsf@loiosh.kei.com>
Reply-to: rfg @ monkeys . com


In message <w4k9fc1me0.fsf@loiosh.kei.com>, 
Christopher Davis <ckd@loiosh.kei.com> wrote:

>RFG> == Ronald F Guilmette <rfg@monkeys.com>
>
> RFG> blacklisting this particular header gives essentially ZERO false
> RFG> positives but it kills LOTS of actual spam.
>
>A very good way to cut down on false positives, in my experience, is to
>use a score-based system.

Forgive me, but I already built and tested a scrore-based system previously,
and it did not work well at all.

Is it really necessary to compute a complicated (and potentially ambiguous)
score value from some piece of mail from Cyberpromo or nevwest?  My own
experience is that attempting to do so is overkill and that it in fact
tend to lead to more errors (both false positives and false negatives)
being made.

>X-UIDL is a strong but not certain indicator;

Having myself checked on the order of 60,000 messages, I can say with some
certainty that it is in fact a certain indicator of spam.  I have gotten
zero false positives checking for this.  (Previously I reported having
one false positive on X-UIDL, but a review of my log files indicates that
I was wrgong about that, and the false positive in question was in fact
due to a different header altogether.)

>so are lots of exclamation points

I agree that body text is _always_ ambiguous, and that is why my own
filter does not perform any examination or checking on either body text
or on Subject: line text.  It is far too easy to get false positives
by doing that.

>procmail, among others, has a very flexible scoring system.  A snippet:
>
>*175^0 H ?? ^Received:.*\(may be forged\)
>*300^0 H ?? ^Received:.*\<(CLOAKED|unknown host)\>
>*100^1 H ?? ^Received:.*---

I must again emphasize that scoring is not generally necessary in order
to do good spam detection.  In fact your own example illustrates this.
If you see the word `CLOAKED' in a Received: header, do you really
believe that there is _any_ chance whatsoever that the message in question
is anything other than spam?

>"Absolutely
>certain" indicators (such as the IEMMC X- headers) are given scores high
>enough to send their messages to the junk bucket without further ado.

Right, and as it turns out, there are more than enough ``absolutely certain''
indicators in the vast majority of all spam, that scoring is not even
necessary, and in fact may be counterproductive.

X-UIDL: is one such unambiguous indicator of spam.

-- Ron Guilmette, Roseville, California ---------- E-Scrub Technologies, Inc.
-- Deadbolt(tm) Personal E-Mail Filter demo: http://www.e-scrub.com/deadbolt/
-- Wpoison (web harvester poisoning) - demo: http://www.e-scrub.com/wpoison/


Follow-Ups:
References:
Indexed By Date Previous: Re: If I may have my own rant...
From: Vince Sabio <vince@humournet.com>
Next: sunyjefferson.edu
From: Mike Nolan <nolan@celery.tssi.com>
Indexed By Thread Previous: Re: validity of Message-Id's
From: Christopher Davis <ckd@loiosh.kei.com>
Next: Re: validity of Message-Id's
From: Christopher Samuel <C.Samuel@eris.dera.gov.uk>

Google
 
Search Internet Search www.greatcircle.com