Contents
Basics
Some basic ideas about improved filesystem storage.
Highlights:
- avoids storing pages by name, stores by an arbitrary ID instead
- no problem with long utf-8 pagenames
- no rename problem
I have implemented this and it passes all current tests.
Filesystem layout:
pages/ \ +- name-mapping | +- name-mapping.lock | +- news | +- 1/ | \... | +- 2/ | \... | +- 3/ | \... | +- 123/ | \ | +- name | +- meta | +- rev.1 | +- rev.2 | +- rev.3 | +- ... | +- tmp-XXXXXX-rev
With the file contents being:
- name-mapping
Item name -> ID mapping, can change under renames and be appended to
- name-mapping.lock
- name-mapping lock dir for multiple concurrent writers
- news
- news file
- name
the item name of the item, in utf-8, needed for id->name backward mapping
- meta
- metadata of the item
- rev.N
- metadata and data of the revision
Details
name-mapping database
Just use cdb, needs write locking only and rename-over (or rename, retry on windows); fast enough to rewrite entire database on a change, ~100ms to change a record in a 10000 record database; very quick access, <1ms for a lookup.
There's pycdb.py (which I took from http://www.unixuser.org/~euske/doc/tcdb/index.html and cleaned up) as well as python-cdb.
Lookup code:
Renaming an item
Write-lock the database and change the entry, aborting if it doesn't exist, still under write-lock update the item's name file.
Committing an item / publishing item metadata
To create an item initially, first assign it an ID (see below) and write the revision or metadata into it. Lock the database for writing and try to add an entry for the desired name, this can fail if it already exists, in which case all associated files are deleted.
ID assignment
I think ID assignment should be random to avoid having to readdir() or keep a counter. Start with a 1e4 order of magnitude and try to find a random ID that doesn't exist yet between 0 and the current magnitude. If doing that three times doesn't succeed because the ID already exists, increase order of magnitude and try again.
Creating a revision
On create_revision(), create a tmp-XXXXXX-rev temporary file to hold the revision data (to avoid holding it in memory), write the four bytes 0x00 0x00 0x02 0x00 into it, pickle an empty dict into it and then seek to byte position 512. Then write revision data into the file. When the new revision is committed, pickle the metadata, check that it doesn't exceed 508 bytes and store it into the temporary file starting at offset 4. If it does exceed 508 bytes, create a new temporary file, write the length into it (in big endian), write the metadata into it and copy the data into it. Remove the original. Then fsync() the data file (new or old) and rename() it into the right directory aborting and deleting it instead on errors (revision already exists.)
This means the file format looks as follows:
bytes |
data |
0..3 |
data offset (N bytes) |
3..N-1 |
pickled metadata |
N .. |
revision data |
512 is a number that can be changed, many filesystems have 4k blocksize so using 512 means that many files will still only occupy a single block while keeping enough space for most metadata initially to avoid the data copy. If moin ever stores huge amounts of metadata, this value might need to be increased, and it could be configurable if the wiki owner expects very long ACL strings.
Reading revision metadata
Seek to file offset 4 and unpickle the metadata dict.
Reading revision data
Read the first four bytes, seek to that offset and read until the end of the file.
Reading item metadata
Unpickle the metadata from the 'meta' file.
Writing item metadata
Locks are kept to avoid concurrent writers, so this isn't a concern, but concurrent readers are possible. Therefore, pickle the new metadata into a meta-new file, fsync() it and rename() it over the old one. Where that isn't possible with concurrent readers, employ retry logic.
News file
When a revision or item is created, append (using open(..., 'ab')) a fixed-length record to the news file, consisting of the 32-bit item ID, 32-bit revision number and 64-bit timestamp (in seconds since ...)
To read, read it backwards.