Why quote WikiNames?

Wiki names can use any charset and any character. When saving a wiki page to the file system, we must use a safe charset the file system supports. The safest solution is to "asciify" the name, so it can be saved on any file system.

In moin up to and including moin 1.2.x, moin used to convert non-ascii characters in names to _xx where xx is config.charset encoding of the character. This format used at least 6 characters for each non-ascii character in the wiki name, limiting the longest wiki name to 44 characters. This situation is very problematic for Hebrew and even worse for asian languages.

Since "_" should be freed for wiki_names like this, the format changed to (xx) in early 1.3 development versions. This format is even worse, since each (hebrew) character uses 8! characters: (xx)(xx). This makes the longest wiki name only 31 (255/8) characters.

Testing 1.3 code shows that the practical longest name for Hebrew is only 30 characters. A name with 31 characters causes IOError: [Errno 63] File name too long.

This 30 characters name is converted to 240 characters filename, and with the time stamp in the backup directory it is a 251 characters name.

Mac OS X notes

On Mac OS X, the file system is using utf8 - with only one illegal character: ":". Maybe we should add a configuration option to avoid filename quoting. This can be much better solution to the wiki administrator, specially if there are many wiki pages in non-western languages. The downside is less portability for the wiki data. If we allow this, we might need to create a converter for such wikis. This can be a big problem, since very long file name can not be converted.

Proposal

To save some space while freeing the "_" character, the format will be (xx...) with whole sequences of non-ascii characters quoted and enclosed inside braces.

Examples

WikiName

File Name

WikiName

WikiName

free link

free(20)link

underscore_link

underscore(5f)link

Page/SubPage

Page(2f)SubPage

\u05e0\u05d9\u05e8

(d7a0d799d7a8)

Limits

For Hebrew:

Languages which use more than 2 bytes per character in utf8 will be limited to shorter names.

Code

Here is the new code for quoting, please test it and add more test cases in different languages, so we can make sure it's really working for everyone - before it is merged into moin.

The code includes functions for quoting, unquoting and converting from both pre 1.3 names and the current 1.3 names and tests for each fuction.

Unquoting function check for invalid file names. It should not happen in normal use, but if it does - it will create a big mess if we don't catch this. Including versions using .find(), regx, and valid only regex \([a-fA-F0-9]+\) and timing tests.

These are the timings for 100 iterations on mix of 10 names (G5 2G, Mac OS X 10.3.3, python 2.3):

Time unquoteWikiname using find: ... 0.0573s ok
Time unquoteWikiname using regex: ... 0.0635s ok
Time unquoteWikiname using find safely: ... 0.0571s ok
Time unquoteWikiname using regex safely: ... 0.0641s ok
Time unquoteWikiname using valid only regex: ... 0.0622s ok

It seems all preform similarly, so the safer code is better (I predicted before, but it was interesting to test).

quote.py

I opened a branch moin--quote--1.0 on my public read only archive. I hope you can register it using http. http://nirs.dyndns.org/nirs@freeshell.org--2004/

You can get the archive by:

% tla register-archive nirs@freeshell.org--2004 \
            http://nirs.dyndns.org/nirs@freeshell.org--2004
% tla get http://nirs.dyndns.org/nirs@freeshell.org--2004/moin--quote--1.0

Progress

Current code review

I searched the current code to find code that trying to work on quoted file names:

Test Wiki

Here is a test wiki using the new quoting scheme: http://nirs.dyndns.org/quote . The new code did not make new problems.

Example of a long page name in this wiki (63 characters, 52 Hebrew characters, 11 spaces)

File name too long errors

There is no error checking for too long filenames or exceptions. This happens also on 1.1 and 1.2 when trying to use very long names. It seems that we get errors even when the name is shorter than the maximum theoretical length, when os.mkdir() is called.

Example error

Converting old file names

More testing

Encoding Problems

The current code use config.charset for file name encoding - what will happen if a Unicode name could not be encoded to non Unicode encoding like iso-8859-1?

But what if someone installed the wiki, then decided to move to iso-8859-1? All the existing utf8 encoded and quoted names will not be readble - because the system will try to decoded them from config.charset, instead of 'utf8'.

If you want to change config.charset - you must convert the existing file names in the wiki, and possibly the log files, from utf8 to another encoding - and this might be impossible.

Something so important as the storage implementation should not be left to the admin. If we use only utf8 or utf16 for file names - it will work for everyone, because you can recode everything into utf8 and back. -- NirSoffer 2004-06-05 11:47:54

We should move to only Unicode file name encoding - any other option is dangerous. The filename encoding does not have to be the same encoding for the wiki text.

Current links in various sites and search engines that use pre 1.3 quoting structure - will break. Here is MacMac on the top of google search:

pre13links.png

This link will break when converting to 1.3.

We can prevent these broken links by unquoting URLs using pre1.3 way. I don't know if we should do it, since search engines will index the new pages quite fast.

MoinMoin: QuotingWikiNames (last edited 2007-10-29 19:07:44 by localhost)