Gauche:SpamFilter:English
(Rough English translation. More literal than natural, mostly done for sake of translator's practice.)
Filtering based on Paul Graham's "A Plan for Spam" and "Better Bayesian Filtering". Designed to be included and used with the Scheme mail filter scmail. Of course it works with Japanese mail as well.
Requirements
Download
Currently in an experimental phase, consider it alpha. In particular,
the loading of the probability tables is unreasonably slow.
Installation
Consult the file README.eucjp in the tarball.
Hacking
The present version is based on the results of Gauche:SpamFilter:予備実験 (preliminary experiment), processing in the following manner:
- Learning is at present performed in batches. Previous spam must
have been separated with a different filter first. The word
probability dictionary for Japanese use and other uses is held
separately (?).
- When used, mail (?) is first partitioned according to scmail's
rules, then the remaining items are judged for their spamminess.
(This is very important. There are a lot of spammy things in the
mail system, so by doing this beforehand you can defend against
false positives and also clean the learning data).
- MIME messages are tokenized by each part. application/* and image/*
parts are skipped. Even text is occasionally quoted-printable or base64
or such encoded, so the content-transfer-encoding is interpretted
precisely. If this isn't done the dictionary will accumulate a large
quantity of base64 character strings.
- Regarding mail without a charset specified, first Japanese is assumed
and if this results in an error, processing is done over again as
non-Japanese.
- Japanese tokenization done by character bigrams. (In short, "未承諾広告"
(not yet consented advertisement) is handled as the words "未承,"
"承諾," "諾広," and "広告"). However, from outside of Kanji to Kanji
transitions, periods, commas, etc. the word is considered broken.
- As stated in Paul Graham's "Better Bayesian Filtering" points for
improvement, the frequency of words appearing only in either the
non-spam or spam corpus is taken into account. Origin and context is
not yet considered.
Usage
Shiro's usage.
Initial Study (2003/3/12 - )
Using inbox, trash, spam folders for study. Message counts and
registered word counts are as follows:
| nonspam
| spam
|
Japanese
| 211178words/3609msgs
| 99058words/1581msgs
|
Other
| 33158words/1577msgs
| 46513words/1854msgs
|
Total
| 244336words/5186msgs
| 145571words/3435msgs
|
This study data was originally judged for spamminess from inbox and spam
with the following results:
- false positives: 0/3265
- false negatives: 68/3443 (detection ratio 98.0%)
- within this, Japanese mail is 4/1854 (detection ratio 99.8%)
tokenization is even good at the beginning, isn't it? (?)
Well then, using this engine from scmail, automatic partitioning by
scmail-refile is tested for a while. scmail-refile then first
partitions the mail according to the message rules, then the remaining
items are judged for spam probability. At this point newly arrived mail
is judged just as one wishes.
最終更新 : 2013/04/28 11:07:34 UTC