[html5] Tidy and HTML5

Thu Dec 2 02:14:19 PST 2010

Keryx Web wrote:
> X-posting to WHATWG help from html-tidy at w3.org
> 
> 2010-11-26 09:36, Adrian Sandor skrev:
> 
> > As I mentioned before, my main concern is about bug fixes. I don't
> > care much
> > about HTML5 support at this time.
> > (But if somebody else has a patch, I will be happy too)

FWIW, I expect supporting HTML5 in Tidy would be an undertaking of the 'rewrite' kind rather than a matter of 'a patch'.

> Thus, I do not see any future in a tool that does not rely on the
> HTML5 parsing algorithm.

I'd agree if Tidy was primarily a markup consumer. I think it's primarily a markup generator. The HTML5 parsing algorithm doesn't aim to fix author typos with the best DWIM imaginable. It just makes HTML consumption interoperable. A tool whose job is to mash invalid input into valid output on the author's side rather than the consumer's side could well compete on how good DWIM it implements compared to the HTML5 parsing algorithm. If I was writing a new Tidy-like tool, I'd probably hack the HTML tokenizer to treat '<' inside a tag token more like legacy Gecko and WebKit treated it (ending the current tag token, emitting it and starting a new one) instead of treating it the way HTML5, IE and Opera treat it.

I think there isn't a future for Tidy if it doesn't preserve valid HTML5 as valid, though.

> A basic "Tidy5" implementation would thus look like this:
> 1. Parse the tag soup into a DOM
> 2. Serialize HTML from that DOM
> 3. Compare the start and the end result.

As I understand why Tidy does, that's not sufficient. The above steps could still result in invalid output. I think the appropriate steps for doing what Tidy aims to do would be:
 1) Parse input into a tree while reporting Parse Errors as defined by the HTML5 spec.
 2) Drop all unknown attributes emitting an error message for each one.
 3) Remove all unknown elements by replacing them with their children and reporting an error message for each such removal.
 4) While the tree has machine-detectable conformance errors, transforming the tree to remove a machine-detectable error and emitting an error message when doing so. (This is the hard part and the core of the value Tidy currently offers.)
 5) Serializing the tree.

> Perhaps any error reporting can be made *during* the parsing process.
> Henri Sivonen could probably answer the question if that is possible.

Adding implied wrapper elements and dropping stuff could be done using a SAX pipeline without an actual in-memory tree. You need a tree somewhere to recover from all possible HTML errors, though.

> - HTML 4 style type attributes on <script> and <style> - tolerate,
> require or drop?

Why would anyone want to require those?

> - Security. This will require the possibility of white and/or
> blacklisting elements and attributes. And preferably also attribute
> values.

Only whitelisting would work for security.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/