Project: Tree based output formatter

This project is part of GoogleSoc2008.

This project adds a new tree based interface between the different parts of the output rendering. This tree can be modified in different ways during the rendering. It gets a distinct mime-type to fit in the following conversions. All the conversions done by this project operates on mime-types as identifier. The type of the input can be anything, e.g. a page with mime-type text/x-moin1.7, an image with image/jpeg or raw data with application/octet-stream. Each converter may supports several input/output mime-types.

The rendering of a wiki page is done in several steps:

The same can be used for image/* (and also application/octet-stream):

Types

text/x-moin1.7

Wiki source as used in MoinMoin 1.7.

application/x-moin-document

New intermediate tree format.

As any spec, it should reuse appropriate standards. What comes in mind is DC (Dublin Core) for author and similar informations (See [DCMI]). The text part may be a proper subset of OpenDocument (See [ODF]) or DocBook. It will also allow HTML in its own (real) namespace (See [XHTML1.1]).

Includes are done with XInclude and XPointer (See [XInclude] and [XPointer]). It may need a special XPointer function to support anything the current Include macro supports.

application/x-xhtml-moin-page

XHTML subset, no html, head (+ contents) and body. As difference to application/xhtml+xml it specifies div as the root element and can be embedded into the theme to generate the real output. This type is only used internal.

Macro handling

The internal tree will use a little bit different macro definition than the Wiki input. Some macros like BR, Include and TOC will be promoted to pseudo-macros and interpreted (_not_ expanded) by the wiki parser.

Macros need to know the context (block vs. inline) they are used in.

BR
It needs to be presented in the tree anyway because it is highly output dependant. HTML implements it as a br-element, ODF as text:line-break.
Include
It needs to be handled special because normal macro results should not be again macro expanded. May use XInclude (see [XInclude]).
TOC
This is not yet decided, but many output formats support automatic toc generation.

Plugin compatibility

The modifications affect three types of MoinMoin plugins, parser, macro and formatter. parser and macro plugins which only use the public formatter API should work using a special implementation of this API which produces a tree instead of complete output; plugins which directly generate output or even use request.write will not work. Compatibility support for formatter plugins will be not provided.

Macros

AbandonedPages

See RecentChanges

Action
unused?
AdvancedSearch
Raw HTML
Anchor
Formatter only, unknown
AttachInfo
unknown
AttachList
unknown
BR
Move to parser
Data, DateTime
Formatter only
EditedSystemPages
Formatter only
EditTemplates
unknown
EmbedObject
Raw HTML
FootNote
Move to parser
FullSearch
Raw HTML
FullSearchCached

See FullSearch

GetText
Formatter only
GetText2
Formatter only
GetVal
Formatter only
GoTo
Raw HTML
Hits
Raw text
Icon
Formatter only
Include
Move to parser
InterWiki
Formatter only
LikePages
Formatter only, recheck
MailTo
Formatter only
MonthCalender
Raw HTML
Navigator
Raw HTML
NewPage
Raw HTML
OrphanedPages
Formatter only
PageCount
Formatter only
PageHits
Formatter only
PageList
Raw HTML
PageSize
Formatter only
RandomPage
Formatter only
RandomQuote
unknown
RecentChanges
Raw HTML
ShowSmileys

widget.browser.DataBrowserWidget

StatsChart
unkown
SystemAdmin
Formatter, raw HTML
SystemInfo
Raw HTML
TableOfContents
Move to parser
TemplateList
Formatter only
TitleIndex
unknown
TeudView
Raw HTML
TitleSearch
unknown
Verbatim
Formatter only
WantedPages
Formatter, raw HTML
WordIndex
unknown

Parser

text/*
Formatter only
text/cplusplus

ParserBase based.

text/creole
Own tree, formatter
text/csv

widget.browser.DataBrowserWidget

text/diff

ParserBase based.

text/docbook
Reuses wiki parser.
text/html
Raw HTML
text/irssi
Formatter
text/java

ParserBase based

text/moin-wiki
Formatter only
text/pascal

ParserBase based

text/python
To be removed (compiles into python code)
text/rst
unknown
text/xslt
unknown, raw HTML

In-memory tree format

There are two approaches to implement tree structures. One low-level like DOM, which only defines elementar types like text, comment and node. The other one is a high-level tree which includes nodes for paragraphs, links and so on. My intention was to use a low-level tree because I want extensibility. In the discussion I wrote the following:

There is no usual way. DOM uses a low-level set of items (node, attribute, text and some more) which can represent the whole set of inputs, see [DOM]. Encoding the node types into the classes will work if you know all possible inputs or you'll get again a catch-all node.

Let's make an example: I want to include MathML, see [MathML]. MathML is an XML application. There are two ways to do that:

Also I think the tree should have a stable "dump" format. XML would be a standardized option. Also someone mentioned that it may be easier to compare the dumps in unit tests instead of direct inspection of the tree.

(!) Or use xpath / other xml tools for the tests.

There are several XML and tree implementations, xml, xml.etree, xml.minidom and lxml.

xml
AFAIK unmaintained, libxml as dependency.
xml.etree (ElementTree)
Actively maintained, one large API problem: no text nodes.
xml.minidom
Old, was never really usable.
lxml
libxml as dependency.

Because of this, I think the best solution is a ElementTree fork which fixes the API problem. I don't really like to fork software but anything else would introduce compiled extensions.

Cacheability

The initial tree only depends on the input page. It should be cached directly after the edit. It is also possible to already expand all "stable" (non-volatile) macros at this time. The tree can be converted to HTML in this half-expanded state and cached.

/!\ How can we convert to html without fully expanding? E.g. if there is some include and toc macro, this could be a problem IMHO.

The converter to html may be applied several times to the tree and will only touch things it knows but will leave the already existing html intact.

Project stages

Plan for GSOC 2008, Plan for extending this project

Further possible projects

Refs

MoinMoin: BastianBlank/TreeOutputFormatter (last edited 2009-08-28 18:11:24 by BastianBlank)