Challenges getting MoinMoin to output XHTML

As of 1.5.0 there are lots of things inside MoinMoin that prevent it from outputing strict XHTML, or for that matter even just well-formed XML (i.e., the tags are properly nested and closed off).

The patch MoinMoinPatch/FormatterApiConsistencyForHtmlAttributes goes a long way to allowing the basic output formatters (text_html.py in particular) to output well-formed XHTML.

Simple syntax stuff

Self-closing tags: Also still need to change other HTML fragments that's sitting around in various python files to conform. Such as changing <link> to <link />, <br> to <br /> or <input> to <input /> and so on. This change is quite easy, but tedious. The list of commonly self-closing tags to watch for include:

Also watch out for <p> without a closing </p>.

The <script> tag should always have a closing </script>, even if it has no contents (such as if it only links to an external file via the src attribute), since IE will ignore it otherwise.

Tables: All <table> elements being output should also include <tbody> elements. This is not only correct XHTML, but also can have important influences on CSS (some browsers render CSS incorrectly if there is no tbody element). As Moin rarely uses other table components like thead, tfoot, or caption; the simple solution is to just always output tbody with the table element as one unit:

{{{<table><tbody>

</tbody></table> }}}

Other simple things: Must change the DOCTYPE, use the xmlns attribute on the <html> element, etc.

{{{<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> ... </html> }}}

Content type

True XHTML should be served with a content type (MIME type) of "application/xhtml+xml", and not "text/html". All modern browsers except IE support that. Fortunately you can in almost all cases use the same output for either MIME type---as long as it's valid XHTML, serving it as just text/plain will almost always work too. The exceptions are:

Determining which content type to serve is quite easy. Just look for "application/xhtml+xml" in the HTTP request's Accept header. If it's not present you must fall back to text/plain.

Line number anchors

The wiki parsers put invisible anchors into the output at any point that the wiki source line number it's on increases. For the html formatter, this results in a <span> element (with no visible content). However this span can be output at any place, even where a span should not occur (such as inside a <table> but outside a <td>; or inside a <ul> but not also inside a <li>.

The formatter probably needs to delay writing anchors until the next legal place for a span element.

Uniqueness of ids: When the [[Include()]] macro is used to include the source of one wiki page inside another, the line number ids used can be duplicated! This is not allowed in XHTML, where all id attribute values must be unique per document.

{{{<span class="anchor" id="line-1"></span>This is in the parent document. <span class="anchor" id="line-1"></span>This is in the included document. }}}

Caching and dynamic content

Right now the weirdest obstacle seems to be the way the page caching system works. This is done by formatter/text_python.py with it's Formatter class. It acts as a front-end to the actual formatter being used, such as formatter/text_html.py. It has a nasty problem where it can re-order the calls to formatter methods. See DeronMeranda/DiscussPythonFormatter.

The current way it works is that it may delay calling the some formatter methods, based upon whether it thinks the output is static or dynamic. When creating a cache for the page it actually generates python code. It will go ahead can call the real formatter for all static content and just put the generated html fragments in the cache. But for dynamic content it will instead insert python code that will call the formatter methods when the cached page is retrieved (and not when the cache is created). This is how for instance some macros like [[FullSearch]] will always output the latest results, even though the wiki source for the page on which the macro belongs has not changed.

However this mix of dynamic and static content means that the real formatter gets it's methods called all out of order (time-wise), even though the final output will effectively have everything pasted back together in the correct document order. But this also means that the formatter class can not hope to keep track of any state during the formatting. It can not do things like keep track of an accurate stack of nested HTML tags.

As a real case, the formatter will currently get called on to output an </h1> before it gets called to output the corresponding <h1> tag!

One posible compromise which may partially help would be to insure that openers and closers are always called in the correct order (e.g., if <h1> is delayed to page-view time, then </h1> should be as well).

Javascript

There are two basic issues with Javascript code that is output:

  1. Use of document.write and such. XHTML can only be actively modified via standard DOM methods.

  2. Simple escaping. Common characters in Javascript like && are reserved XML characters. They need to be escaped.

For the escaping case, it depends on if the document is served as text/html or application/xhtml+xml, as well as if the Javascript is inside <script> elements or directly inline such as in an onchange event handler attribute.

CDATA sections would look like, {{{<script language="javascript"><![CDATA[

]]></script> }}}

Obviously putting most javascript in external *.js files is probably a good thing. Also any complex code inside attribute values should probably be converted into function calls, where the code is defined inside a <script> instead.

CSS stylesheet namespace

Futhermore, for inline CSS stylesheets using <style> elements in the <head> section, it can be quite useful (especially for mixed XML, such as with MathML or SVG) if the default namespace is set via the @namespace at-rule (with the same value as the <html> xmlns attribute has):

{{{<style type="text/css"><![CDATA[

]]></style> }}}

This is not terribly important, but is probably easy to do.

Discussion

MoinMoin: DeronMeranda/ChallengesWithXHTML (last edited 2007-10-29 19:13:39 by localhost)