MoinMoin Spam Classification Project

The MoinMoin Text Classification Project consists of two parts:

Spam Detection

The Spam Detection part is primarily a interface to a full-fledged Statistical Spam Classifier. The supported classifiers are SpamBayes and a generic built-in statistical classifier. For general usage the recommended classifier is SpamBayes. The classifiers can be switch using the multiconfig option "spam_classifier". Valid options are:

Local SpamBayes classifier

This interface was developed using SpamBayes 1.0.4 and SpamBayes 1.1a4 and requires a local instalation of SpamBayes. It might work using older versions but no testing was done in this direction. Enabling the SpamBayes interface requires changes to wikiconfig.py:

# Import the classifier and the associated tokenizer

from MoinMoin.classifier.SpamBayes import SBClassifier, SBTokenizer

# Set the classifier and the required options
spam_classifier = SBClassifier(dbfile="spambayes.db") # dbfile - the filename of the internal token database

# The tokenizer that will extract data for SpamBayes
# SBTokeizer is primarily an modified version of the internal SB Classifier
spam_classifier.tokenizer = SBTokenizer

# The page that will host the trained spam pages
spam_classifier.spam_page = "SpamPages"

# The page that will host the trained ham pages
spam_classifier.spam_page = "HamPages"

# Replace high-bit chars and control chars with '?'
classify_replace_nonascii_chars = True

# The probability above which the system classifies a page as spam
classify_spamtreshold = 0.6

# Enable classification caching
classify_usecache = True

Please see the section Speed Considerations for some performance issues.

Installing spambayes

SpamBayes is available in most Linux Distributions repository's. For Ubuntu and debian use:

apt-get install spambayes

For Fedora use:

yum install spambayes

For installing SpamBayes manualy you need to download SpamBayes from http://spambayes.org/download.html, unpack the source and:

python setup.py install

Remote SpamBayes Classifier

This is primarily a proxy for SpamBayes 1.1a4 XMLRPCPlugin. It has no local dependency's, it uses XML-RPC for interacting with SpamBayes. The main requirement of the interface is a working SpamBayes 1.1a4 instance with !XMLRPCPlugin enabled.

Instaling SpamBayes 1.1.a4

For installing SpamBayes manualy you need to download spambayes-1.1a4.tar.gz from http://spambayes.org/download.html, unpack the source and patch it with the following patch. After patching the source run:

python setup.py install

Enabling SpamBayes's XML-RPC plug-in

In order to use SpamBayes XML-RPC you need to start core_server.py and instruct him to load the XML-RPC plug-in

python scripts/core_server.py -P XMLRPCPlugin

After starting core_server.py a web interface is started at port 8880. The UI allows checking the status of the classifier and reporting spam/ham. SpamBayes enables the XML-RPC interface at the default port 5001.

Configuring MoinMoin to use the remote classifier

The configuration of SpamBayes XML-RPC interface is similar to the configuration of the local classifier, the required wikiconfig.py changes are:

# Import the classifier

from MoinMoin.classifier.SpamBayesRPC import RemoteSBClassifier

# Set the classifier and the required options
spam_classifier = spam_classifier = RemoteSBClassifier("localhost", 8001, "sbrpc")

# The page that will host the trained spam pages
spam_classifier.spam_page = "SpamPages"

# The page that will host the trained ham pages
spam_classifier.spam_page = "HamPages"

# Enable classification caching
classify_usecache = True

Speed issues

Like almost any system that analyzes the entire text of pages, the classifier might be slow. This issue can be addressed by enabling classification result caching. To enable classification caching add the following to wikiconfig.py:

classify_usecache = True

The default setting for classify_usecache is to disable the caching.

The remote classifier interface could be slow when the classifier and MoinMoin are connected using a slow connection. This issue can be partially mitigated enabling caching

Known Issues

MoinMoin: MarianNeagul/SpamFilter (last edited 2007-10-29 19:08:48 by localhost)