See also: AntiSpamGlobalSolution/ReFactoring

Description

Moin was too slow when verifying BadContent. This patch adds a cache.

Steps to reproduce

create a page, hit the save button, drink a coffee, wait until save is finished ...

Example

http://pipapo.org/iowiki/SystemInfo

Improvments

Caching antispam regular expressions

--- antispam.py.org     2005-02-26 00:03:50.000000000 +0100
+++ antispam.py 2005-02-26 23:46:34.000000000 +0100
@@ -21,6 +21,7 @@
 from MoinMoin.security import Permissions
 from MoinMoin import caching, wikiutil
 
+mmblcache = []
 
 # Errors ---------------------------------------------------------------
 
@@ -123,6 +124,10 @@
                                            response)
                     p._write_file(response)
 
+                # clear the cache
+                global mmblcache
+                mmblcache = []
+
             except (timeoutsocket.Timeout, timeoutsocket.error), err:
                 # Log the error
                 # TODO: check if this does not fill the logs!
@@ -158,13 +163,23 @@
             for pn in BLACKLISTPAGES:
                 do_update = (pn != "LocalBadContent")
                 blacklist += getblacklist(request, pn, do_update)
+            global mmblcache
             if blacklist:
-                for blacklist_re in blacklist:
-                    try:
-                        match = re.search(blacklist_re, newtext, re.I)
-                    except re.error, err:
-                        dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
-                        continue
+                request.clock.start('mmblcache')
+                if not mmblcache:
+                    for blacklist_re in blacklist:
+                        try:
+                            mmblcache.append(re.compile(blacklist_re, re.I))
+                        except re.error, err:
+                            dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
+                            continue
+                request.clock.stop('mmblcache')
+
+                for blacklist_re in mmblcache:
+                    request.clock.start('regexmatch')
+                    match = blacklist_re.search(newtext)
+                    request.clock.stop('regexmatch')
+
                     if match:
                         # Log error and raise SaveError, PageEditor
                         # should handle this.

Note: will only improve persistent servers, not for CGI.

Discussion

Gives a huge speedup on persistent servers. I dont know how much it affects CGI servers, it prolly gives a small slowdown.

Could be a workaround for 1.3 until the final persistent caching of compiled regex is implemented in 1.4 (and hopefully backported to 1.3).

This cache save the compiling time, but most of the time is needed for searching in the page. For big page, this can be 10s on a fast machine, while the compiling time is about 1s. For small pages, like this one, it does help. See AntiSpamGlobalSolution/ReFactoring, section "Searching only in links". If you combine the two, anti spam will run great even on low end machines.

I think that your first patch could be much simpler using Python builtin re cache. When you create a regex object (using re.compile(pattern)), python cache the object (probably by pattern), and the next compile of the same pattern cost nothing. So to have this cache, all we have to do is create a list of re object and then use the objects.search method, instead of using re.search. Check how this is done in the time_antispam.py script in AntiSpamGlobalSolution/ReFactoring. Check the example timings on that page. -- NirSoffer 2005-02-27 02:05:04

Another problem with this patch is you empty the all the cache when either BadContent or LocalBadContent change - but you don't need to do this. Here is a patch that hold a cache for each page, and update the cache for each page separately. Its not ready for merging, I don't have time to test it now.

   1 * looking for arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633 to compare with
   2 * comparing to arch@arch.thinkmo.de--2003-archives/moin--main--1.3--patch-633
   3 M  MoinMoin/util/antispam.py
   4 
   5 * modified files
   6 
   7 --- orig/MoinMoin/util/antispam.py
   8 +++ mod/MoinMoin/util/antispam.py
   9 @@ -21,6 +21,8 @@
  10  from MoinMoin.security import Permissions
  11  from MoinMoin import caching, wikiutil
  12  
  13 +BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
  14 +_cache = {}
  15  
  16  # Errors ---------------------------------------------------------------
  17  
  18 @@ -53,14 +55,22 @@
  19  
  20  
  21  def makelist(text):
  22 -    """ Split text into lines, strip them, skip # comments """
  23 +    """ Split text into lines, strip them, skip # comments
  24 +
  25 +    Retunr list of re objects.
  26 +    """
  27      lines = text.splitlines()
  28      list = []
  29      for line in lines:
  30          line = line.split(' # ', 1)[0] # rest of line comment
  31          line = line.strip()
  32          if line and not line.startswith('#'):
  33 -            list.append(line)
  34 +            try:
  35 +                scanner = re.compile(line, re.I)
  36 +                list.append(scanner)
  37 +            except re.error, err:
  38 +                dprint("Error in regex '%s': %s. Please check the pages %s." % (
  39 +                    line, str(err), ', '.join(BLACKLISTPAGES)))
  40      return list
  41  
  42  
  43 @@ -72,6 +82,8 @@
  44      @rtype: list
  45      @return: list of blacklisted regular expressions
  46      """
  47 +    global _cache
  48 +    
  49      from MoinMoin.PageEditor import PageEditor
  50      p = PageEditor(request, pagename, uid_override="Antispam subsystem")
  51      if do_update:
  52 @@ -122,6 +134,8 @@
  53                          raise WikirpcError("failed to get BadContent data",
  54                                             response)
  55                      p._write_file(response)
  56 +                    # Delete cache for this page
  57 +                    _cache[pagename] = None
  58  
  59              except (timeoutsocket.Timeout, timeoutsocket.error), err:
  60                  # Log the error
  61 @@ -138,34 +152,37 @@
  62  
  63              # set back socket timeout
  64              timeoutsocket.setDefaultSocketTimeout(old_timeout)
  65 -                
  66 -    blacklist = p.get_raw_body()
  67 -    return makelist(blacklist)
  68 +            
  69 +    # Return cached blacklist or create new
  70 +    if not pagename in _cache:
  71 +        blacklist = p.get_raw_body()
  72 +        _cache[pagename] = makelist(blacklist)
  73 +
  74 +    return _cache[pagename]
  75  
  76  
  77  class SecurityPolicy(Permissions):
  78      """ Extend the default security policy with antispam feature """
  79      
  80      def save(self, editor, newtext, rev, **kw):
  81 -        BLACKLISTPAGES = ["BadContent", "LocalBadContent"]
  82          if not editor.page_name in BLACKLISTPAGES:
  83              request = editor.request
  84  
  85              # Start timing of antispam operation
  86              request.clock.start('antispam')
  87 -            
  88 +
  89 +            request.clock.start('antispam-get-blacklist')
  90              blacklist = []
  91              for pn in BLACKLISTPAGES:
  92                  do_update = (pn != "LocalBadContent")
  93                  blacklist += getblacklist(request, pn, do_update)
  94 +            request.clock.stop('antispam-get-blacklist')
  95              if blacklist:
  96 -                for blacklist_re in blacklist:
  97 -                    try:
  98 -                        match = re.search(blacklist_re, newtext, re.I)
  99 -                    except re.error, err:
 100 -                        dprint("Error in regex '%s': %s. Please check the pages %s." % (blacklist_re, str(err), ', '.join(BLACKLISTPAGES)))
 101 -                        continue
 102 +                request.clock.start('antispam-match')
 103 +                for scanner in blacklist:
 104 +                    match = scanner.search(newtext)
 105                      if match:
 106 +                        request.clock.stop('antispam-match')
 107                          # Log error and raise SaveError, PageEditor
 108                          # should handle this.
 109                          _ = editor.request.getText
 110 @@ -175,6 +192,7 @@
 111                              }
 112                          dprint(msg)
 113                          raise editor.SaveError(msg)
 114 +                request.clock.stop('antispam-match')
 115              request.clock.stop('antispam')
 116              
 117          # No problem to save if my base class agree
cache-blacklist.patch

About sharing the cache between processes, this introduce new problems, and probably needs locking. As processes are not created and killed often, its not very important.

Check only diffs

Here is the patch which checks only new lines added against BadContent regexps. This gives a HUGE speed improvement even on HUGE pages.

<!> Note:

Nice, but where are the numbers that show the HUGE difference?

addendum: http://cehteh.homeunix.org/pipawiki/PasteBin/absmi ... really big page (page size is 297156 bytes), takes ages to render in my browser

Plan


CategoryMoinMoinBugFixed CategoryRelease1.3.5

MoinMoin: MoinMoinBugs/SlowAntispam (last edited 2007-10-29 19:19:14 by localhost)