Text Classification Proposal

Warning this is work in progress.

The corpus

The corpus should be mantained the standard wiki way:

  1. A standard group page (Eg. AutoCategories) that contains all the possible categories. For two way classification it would contain only the SpamPages and HamPages categories.

  2. Each subpage of AutoCategories contains a list of pages that fit in the specific category.

When a user classifies a page in a category the corresponding page is updated as needed.


Classification

For Text Classification we would have two main cases:

  1. Two way classification
  2. n-way classificationn

For implementing this features we need to develop/imagine a system that will allow use to do any of the above mentioned types of classification. The system should be modular in order to allow the user to deploy any type of classifier (based on a framework of his choice) using the existing plugin system. MoinMoin should provide at least one such plugin with the vanilla package: SpamClassifier.

Two way classification

This is the most simple case and consists mainly in classifying pages as ham/spam. For this type of classification the best choice is to use the SpamBayes project.

Useful readings:

Problems:

n-way classification

This is the general case of text classification. In this case we need to decide if we want to classify in:

  1. a fixed number of categories
  2. a variable number of categories

Fixed number of categories

This is the ideal case in which the user have a predefined list of categories and he has to chose one of them (Eg. Tehnical/Finacial/Fun/Spam/Ham/etc. ) For this type of classification a good approach is using SVM's (eventualy through libsvm's python bindings or by using a specialized framework: Elefant)

Variable number of categories

TODO


User Interaction

TODO


Integration With MoinMoin

TODO


Comments

MoinMoin: MarianNeagul/ProjectIdeas (last edited 2007-10-29 19:19:23 by localhost)