scbayes (Gauche:SpamFilter)

(Rough English translation. More literal than natural, mostly done for sake of translator's practice.)

Filtering based on Paul Graham's "A Plan for Spam" and "Better Bayesian Filtering". Designed to be included and used with the Scheme mail filter scmail. Of course it works with Japanese mail as well.



Currently in an experimental phase, consider it alpha. In particular, the loading of the probability tables is unreasonably slow.


Consult the file README.eucjp in the tarball.


The present version is based on the results of Gauche:SpamFilter:予備実験 (preliminary experiment), processing in the following manner:


Shiro's usage.

Initial Study (2003/3/12 - )

Using inbox, trash, spam folders for study. Message counts and registered word counts are as follows:

nonspam spam
Japanese 211178words/3609msgs 99058words/1581msgs
Other 33158words/1577msgs 46513words/1854msgs
Total 244336words/5186msgs 145571words/3435msgs

This study data was originally judged for spamminess from inbox and spam with the following results:

tokenization is even good at the beginning, isn't it? (?)

Well then, using this engine from scmail, automatic partitioning by scmail-refile is tested for a while. scmail-refile then first partitions the mail according to the message rules, then the remaining items are judged for spam probability. At this point newly arrived mail is judged just as one wishes.

