Gauche:SpamFilter:English

scbayes (Gauche:SpamFilter)

(Rough English translation. More literal than natural, mostly done for sake of translator's practice.)

Filtering based on Paul Graham's "A Plan for Spam" and "Better Bayesian Filtering". Designed to be included and used with the Scheme mail filter scmail. Of course it works with Japanese mail as well.

scbayes (Gauche:SpamFilter)

Initial Study (2003/3/12 - )

Requirements

Gauche 0.6.8 or later
scmail 0.1 or later

Download

Currently in an experimental phase, consider it alpha. In particular, the loading of the probability tables is unreasonably slow.

http://practical-scheme.net/vault/scbayes.tgz

Installation

Consult the file README.eucjp in the tarball.

Hacking

The present version is based on the results of Gauche:SpamFilter:予備実験 (preliminary experiment), processing in the following manner:

Learning is at present performed in batches. Previous spam must have been separated with a different filter first. The word probability dictionary for Japanese use and other uses is held separately (?).

When used, mail (?) is first partitioned according to scmail's rules, then the remaining items are judged for their spamminess. (This is very important. There are a lot of spammy things in the mail system, so by doing this beforehand you can defend against false positives and also clean the learning data).

MIME messages are tokenized by each part. application/* and image/* parts are skipped. Even text is occasionally quoted-printable or base64 or such encoded, so the content-transfer-encoding is interpretted precisely. If this isn't done the dictionary will accumulate a large quantity of base64 character strings.

Regarding mail without a charset specified, first Japanese is assumed and if this results in an error, processing is done over again as non-Japanese.

Japanese tokenization done by character bigrams. (In short, "未承諾広告" (not yet consented advertisement) is handled as the words "未承," "承諾," "諾広," and "広告"). However, from outside of Kanji to Kanji transitions, periods, commas, etc. the word is considered broken.

As stated in Paul Graham's "Better Bayesian Filtering" points for improvement, the frequency of words appearing only in either the non-spam or spam corpus is taken into account. Origin and context is not yet considered.

Usage

Shiro's usage.

Initial Study (2003/3/12 - )

Using inbox, trash, spam folders for study. Message counts and registered word counts are as follows:

	nonspam	spam
Japanese	211178words/3609msgs	99058words/1581msgs
Other	33158words/1577msgs	46513words/1854msgs
Total	244336words/5186msgs	145571words/3435msgs

This study data was originally judged for spamminess from inbox and spam with the following results:

false positives: 0/3265
false negatives: 68/3443 (detection ratio 98.0%)
- within this, Japanese mail is 4/1854 (detection ratio 99.8%)

tokenization is even good at the beginning, isn't it? (?)

Well then, using this engine from scmail, automatic partitioning by scmail-refile is tested for a while. scmail-refile then first partitions the mail according to the message rules, then the remaining items are judged for spam probability. At this point newly arrived mail is judged just as one wishes.