preamble

Thanks for this PoC it shows that its possible to solve integrating spambayes and probably other categorizers into the MM codebayes with your help! Well done! :) :) :)

All my thoughts are comments, please review them if they could be solved easily. If there are some which take much time, shift them at the end and do solve at first the once which are easily.

I know its only a proof of concept. But I like to tell all things I want or came across and I know myself a PoC doesn't need them at all. I like to tell them so we don't miss one (probably one of my wishes) ;-)

principle things

.hgrc

[ui] username=user and email adress (email <name AT adress DOT adress>)

documentation

start as early you can writing documentations. e.g. a README it does not matter if it looks draft, e.g. you can use your wiki or MoinMoin and one of your pages there. If you use MM someone can contribute.

Extending

LOG and repeat

I do believe a log file which pages where set to ham and which to spam by who showing the revision used is nice to have. And it should include the possibility to be repeated (by a moin subcommand) and perhaps it should show how spambayes has tested this page before. Perhaps the standard log file could be used for that too. (Needs to be checked)

I do want this for testing purposes too, because with that it should be possible to recognize which addition makes a big or small change or when the system becomes to be clever.

if someone did made a mistake and has entered ham for spam, with that it should be easily to correct.

dbase filename

while spambayes is a two way categorizer it could be perfect used to train for monolanguage wikis. So I want to know from the name of the database for what it was trained.

CodeBase

coding

MoinMoin/Page.py

resolving path for dbfile

   1 event_log = self.request.rootpage.getPagePath('event-log', isfile=1)

you may want use self.request.cfg.data_dir directly

and if we like to exchange the categorizer then may be it would be a good idea to save the dbase file in a dir whose file name belongs to its categorizer

e.g. self.request.cfg.data_dir /categorizer/spambayes/en.db

   1 if not self.classifier:

instead

   1   if self.classifier is None:

-- ReimarBauer 2007-05-01 08:24:54


Ideeas/Problems


Further Extensions for classifications


Questions

  1. I guess I don't understand how code development is managed for Moin. Where is the code for the proof of concept? -- SkipMontanaro

  2. How do you propose to save spammish submissions for later retraining of the SBClassifier? -- SkipMontanaro

    1. The pages submited by users (eg. admins) would be automatically appended in the correct category: SpamPage HamPages. This way the filter know what pages have been trained and the admin has a view on the training corpus. -- MarianNeagul 2007-05-06 17:19:18

  3. What pages will be trained on? -- SkipMontanaro

    1. Eg. TrainOnErrors -- MarianNeagul 2007-05-06 17:19:18

  4. Reimar what is the proper way of updating category pages ? Please take a look at my repo for the way I have done it. -- MarianNeagul 2007-05-13 20:01:20

    • pg.saveText(pg.get_raw_body()+" * %s\n" % (thispage.page_name, ), 0) We should better talk from ham, spam cataloged pages, because the name category for a wiki page is used different. ;-) May be a Dict write could be added later on. Both pages do need an acl line. -- ReimarBauer 2007-05-14 01:18:27

  5. Skip, Could we do some spam checking on differences between pages? I think this would be useful in detecting spammers that append their text in a page and maintain the original text. Using this techniques spammers are avoiding Bayesian text classification because it is not probable that the weights of the spam text features will influence the global classification of the page. Would be this the place where we could use Bayesian Noise Reduction to identify the text that does not fit in a specific page ? -- MarianNeagul 2007-05-13 20:01:20

    • Do you mean to check edits? I think the only thing that should be checked is the new content. Sorry, I'm not familiar with Bayesian Noise Reduction. 2007-05-16 17:47:00 Skip, see the logs at MoinMoinChat/Logs/moin-dev/2007-05-14 -- ReimarBauer 2007-05-16 20:06:40

      • Reimar: I don't understand this comment:
        • 2007-05-14T23:26:48 <dreimark> it would be nicer to see them in percent, and to use 0 for unknown and then we could have 100.00 % ham and 100.00 % spam

      • The SpamBayes classifier when properly trained produces a strongly bimodal distribution with ham scoring at or very near 0.0 and spam scoring at or very near 1.0. Unsure falls in the middle. 2007-05-16 20:47:00

        • The current version shows in the info line of the page ham 0.0000 or spam 1.000 with a lot of digits. I do prefer this different shown, yesterday we discussed with starshine to use symbols to visualize the categorisation state. -- ReimarBauer 2007-05-17 09:24:03

    • Yes, exactly! By "checking only new content" you mean checking only the new text added and not the resulting page ? -- MarianNeagul 2007-05-16 20:15:49

      • I think what you might want to experiment with generating a mime/multipart "message" from the submission to feed to the classifier. Any attachments to the page would be attached as the appropriate MIME type. The primary advantage of this is that it would simply feed into a minimally modified tokenizer (since SpamBayes already deals with mime email messages). The first text/plain section would include either the full submission for page creates or just the insertion text for page edits (the ">" lines from a simple diff would probably work just fine). You would probably also put in any synthetic tokens you generate that you believe will help distinguish ham from spam: were they logged in? how long since they first created their profile? what fraction of the page was deleted (approximately diff's "<" lines / original number of lines)? etc

      • Checking only new content enable a subtle way to fool the spam detection: use 2 edits. The first one adds a reasonnable looking text (like "My wiki commercialisation is here: www.exemple.org"). The second one remove some portions of the previous text (like "My wiki commer" and "ation"). -- JeanPhilippeGuĆ©rard 2007-05-16 22:12:44

        • If use line diff, I don't think it is a problem, since links can't span lines. If you are not sure, add always some context - few lines before and after the new content.

MoinMoin: MarianNeagul/ProofOfConcept (last edited 2007-10-29 19:12:06 by localhost)