Basics

Some basic ideas about improved filesystem storage.

Highlights:

avoids storing pages by name, stores by an arbitrary ID instead
- no problem with long utf-8 pagenames
- no rename problem

I have implemented this and it passes all current tests.

Filesystem layout:

pages/
    \
     +- name-mapping
     |
     +- name-mapping.lock
     |
     +- news
     |
     +- 1/
     | \...
     |
     +- 2/
     | \...
     |
     +- 3/
     | \...
     |
     +- 123/
     | \
     |  +- name
     |  +- meta
     |  +- rev.1
     |  +- rev.2
     |  +- rev.3
     |  +- ...
     |
     +- tmp-XXXXXX-rev

With the file contents being:

name-mapping: Item name -> ID mapping, can change under renames and be appended to
name-mapping.lock: name-mapping lock dir for multiple concurrent writers
news: news file
name: the item name of the item, in utf-8, needed for id->name backward mapping
meta: metadata of the item
rev.N: metadata and data of the revision

Details

name-mapping database

Just use cdb, needs write locking only and rename-over (or rename, retry on windows); fast enough to rewrite entire database on a change, ~100ms to change a record in a 10000 record database; very quick access, <1ms for a lookup.

There's pycdb.py (which I took from http://www.unixuser.org/~euske/doc/tcdb/index.html and cleaned up) as well as python-cdb.

Lookup code:

   1 import cdb
   2 
   3 key = 'lwaekdvaaoalhwwvmbej'
   4 
   5 c = cdb.init('test')
   6 
   7 c.get(key) # returns None if not found

Renaming an item

Write-lock the database and change the entry, aborting if it doesn't exist, still under write-lock update the item's name file.

   1 import cdb
   2 
   3 key = 'lwaekdvaaoalhwwvmbej'
   4 
   5 c = cdb.init('test')
   6 maker = cdb.cdbmake('test', "test.tmp")
   7 
   8 r = c.each()
   9 while r:
  10     i, v = r
  11     if i == key:
  12         maker.add(i, 'newid')
  13     else:
  14         maker.add(i, v)
  15     r = c.each()
  16 
  17 maker.finish()

Committing an item / publishing item metadata

To create an item initially, first assign it an ID (see below) and write the revision or metadata into it. Lock the database for writing and try to add an entry for the desired name, this can fail if it already exists, in which case all associated files are deleted.

ID assignment

I think ID assignment should be random to avoid having to readdir() or keep a counter. Start with a 1e4 order of magnitude and try to find a random ID that doesn't exist yet between 0 and the current magnitude. If doing that three times doesn't succeed because the ID already exists, increase order of magnitude and try again.

Creating a revision

On create_revision(), create a tmp-XXXXXX-rev temporary file to hold the revision data (to avoid holding it in memory), write the four bytes 0x00 0x00 0x02 0x00 into it, pickle an empty dict into it and then seek to byte position 512. Then write revision data into the file. When the new revision is committed, pickle the metadata, check that it doesn't exceed 508 bytes and store it into the temporary file starting at offset 4. If it does exceed 508 bytes, create a new temporary file, write the length into it (in big endian), write the metadata into it and copy the data into it. Remove the original. Then fsync() the data file (new or old) and rename() it into the right directory aborting and deleting it instead on errors (revision already exists.)

This means the file format looks as follows:

bytes	data
0..3	data offset (N bytes)
3..N-1	pickled metadata
N ..	revision data

512 is a number that can be changed, many filesystems have 4k blocksize so using 512 means that many files will still only occupy a single block while keeping enough space for most metadata initially to avoid the data copy. If moin ever stores huge amounts of metadata, this value might need to be increased, and it could be configurable if the wiki owner expects very long ACL strings.

Reading revision metadata

Seek to file offset 4 and unpickle the metadata dict.

Reading revision data

Read the first four bytes, seek to that offset and read until the end of the file.

Reading item metadata

Unpickle the metadata from the 'meta' file.

Writing item metadata

Locks are kept to avoid concurrent writers, so this isn't a concern, but concurrent readers are possible. Therefore, pickle the new metadata into a meta-new file, fsync() it and rename() it over the old one. Where that isn't possible with concurrent readers, employ retry logic.

News file

When a revision or item is created, append (using open(..., 'ab')) a fixed-length record to the news file, consisting of the 32-bit item ID, 32-bit revision number and 64-bit timestamp (in seconds since ...)

To read, read it backwards.

MoinMoin: JohannesBerg/FilesystemStorage (last edited 2011-03-25 20:40:30 by PaulBoddie)