(Rough English translation. More literal than natural, mostly done for sake of translator's practice.)
Currently in an experimental phase, consider it alpha. In particular, the loading of the probability tables is unreasonably slow.
Consult the file README.eucjp in the tarball.
The present version is based on the results of Gauche:SpamFilter:予備実験 (preliminary experiment), processing in the following manner:
- Learning is at present performed in batches. Previous spam must have been separated with a different filter first. The word probability dictionary for Japanese use and other uses is held separately (?).
- When used, mail (?) is first partitioned according to scmail's rules, then the remaining items are judged for their spamminess. (This is very important. There are a lot of spammy things in the mail system, so by doing this beforehand you can defend against false positives and also clean the learning data).
- MIME messages are tokenized by each part. application/* and image/* parts are skipped. Even text is occasionally quoted-printable or base64 or such encoded, so the content-transfer-encoding is interpretted precisely. If this isn't done the dictionary will accumulate a large quantity of base64 character strings.
- Regarding mail without a charset specified, first Japanese is assumed and if this results in an error, processing is done over again as non-Japanese.
- Japanese tokenization done by character bigrams. (In short, "未承諾広告" (not yet consented advertisement) is handled as the words "未承," "承諾," "諾広," and "広告"). However, from outside of Kanji to Kanji transitions, periods, commas, etc. the word is considered broken.
- As stated in Paul Graham's "Better Bayesian Filtering" points for improvement, the frequency of words appearing only in either the non-spam or spam corpus is taken into account. Origin and context is not yet considered.
Initial Study (2003/3/12 - )
Using inbox, trash, spam folders for study. Message counts and registered word counts are as follows:
This study data was originally judged for spamminess from inbox and spam with the following results:
- false positives: 0/3265
- false negatives: 68/3443 (detection ratio 98.0%)
- within this, Japanese mail is 4/1854 (detection ratio 99.8%)
tokenization is even good at the beginning, isn't it? (?)
Well then, using this engine from scmail, automatic partitioning by scmail-refile is tested for a while. scmail-refile then first partitions the mail according to the message rules, then the remaining items are judged for spam probability. At this point newly arrived mail is judged just as one wishes.