MoinMoin Spam Classification Project
Contents
- Spam Detection
- General Text Classification
Spam Detection
The Spam Detection part is primarily a interface to a full-fledged Statistical Spam Classifier. The supported classifiers are SpamBayes and a generic built-in statistical classifier. For general usage the recommended classifier is SpamBayes. The classifiers can be switch using the multiconfig option "spam_classifier". Valid options are:
- SBClassifier classifier (Local spambayes classifier)
- SpamBayesRPC classifier (Remote spambayes classifier)
ReverendClassifier classifier (Local, built-in classifier)
Local SpamBayes classifier
This interface was developed using SpamBayes 1.0.4 and SpamBayes 1.1a4 and requires a local instalation of SpamBayes. It might work using older versions but no testing was done in this direction. Enabling the SpamBayes interface requires changes to wikiconfig.py:
# Import the classifier and the associated tokenizer from MoinMoin.classifier.SpamBayes import SBClassifier, SBTokenizer # Set the classifier and the required options spam_classifier = SBClassifier(dbfile="spambayes.db") # dbfile - the filename of the internal token database # The tokenizer that will extract data for SpamBayes # SBTokeizer is primarily an modified version of the internal SB Classifier spam_classifier.tokenizer = SBTokenizer # The page that will host the trained spam pages spam_classifier.spam_page = "SpamPages" # The page that will host the trained ham pages spam_classifier.spam_page = "HamPages" # Replace high-bit chars and control chars with '?' classify_replace_nonascii_chars = True # The probability above which the system classifies a page as spam classify_spamtreshold = 0.6 # Enable classification caching classify_usecache = True
Please see the section Speed Considerations for some performance issues.
Installing spambayes
SpamBayes is available in most Linux Distributions repository's. For Ubuntu and debian use:
apt-get install spambayes
For Fedora use:
yum install spambayes
For installing SpamBayes manualy you need to download SpamBayes from http://spambayes.org/download.html, unpack the source and:
python setup.py install
Remote SpamBayes Classifier
This is primarily a proxy for SpamBayes 1.1a4 XMLRPCPlugin. It has no local dependency's, it uses XML-RPC for interacting with SpamBayes. The main requirement of the interface is a working SpamBayes 1.1a4 instance with !XMLRPCPlugin enabled.
Instaling SpamBayes 1.1.a4
For installing SpamBayes manualy you need to download spambayes-1.1a4.tar.gz from http://spambayes.org/download.html, unpack the source and patch it with the following patch. After patching the source run:
python setup.py install
Enabling SpamBayes's XML-RPC plug-in
In order to use SpamBayes XML-RPC you need to start core_server.py and instruct him to load the XML-RPC plug-in
python scripts/core_server.py -P XMLRPCPlugin
After starting core_server.py a web interface is started at port 8880. The UI allows checking the status of the classifier and reporting spam/ham. SpamBayes enables the XML-RPC interface at the default port 5001.
Configuring MoinMoin to use the remote classifier
The configuration of SpamBayes XML-RPC interface is similar to the configuration of the local classifier, the required wikiconfig.py changes are:
# Import the classifier from MoinMoin.classifier.SpamBayesRPC import RemoteSBClassifier # Set the classifier and the required options spam_classifier = spam_classifier = RemoteSBClassifier("localhost", 8001, "sbrpc") # The page that will host the trained spam pages spam_classifier.spam_page = "SpamPages" # The page that will host the trained ham pages spam_classifier.spam_page = "HamPages" # Enable classification caching classify_usecache = True
Speed issues
Like almost any system that analyzes the entire text of pages, the classifier might be slow. This issue can be addressed by enabling classification result caching. To enable classification caching add the following to wikiconfig.py:
classify_usecache = True
The default setting for classify_usecache is to disable the caching.
The remote classifier interface could be slow when the classifier and MoinMoin are connected using a slow connection. This issue can be partially mitigated enabling caching