[html5] Tidy and HTML5

Fri Nov 26 06:02:17 PST 2010

X-posting to WHATWG help from html-tidy at w3.org

2010-11-26 09:36, Adrian Sandor skrev:

> As I mentioned before, my main concern is about bug fixes. I don't care much
> about HTML5 support at this time.
> (But if somebody else has a patch, I will be happy too)
>

Here is the deal with HTML5. Pretty soon every browser will have an 
HTML5 parser. Except for IE, browsers do not have multiple parsers.

This means that tokenization and DOM tree building will follow the rules 
defined in HTML5 - as opposed to not really following any rules at all, 
since HTML 4 never defined them.

Simply put, there is no "opt out" of HTML5. An HTML 4 or XHTML 1.x 
doctype is nothing more than a contract between developers. Technically 
all it does is to set the browser in standards compliance mode.

Thus, I do not see any future in a tool that does not rely on the HTML5 
parsing algorithm. Tidy can not grow from its current code base, but 
needs to have the same html5lib at its core that is in the HTML5 
validator, which basically is the same as the one being used in Firefox 4.

A basic "Tidy5" implementation would thus look like this:
1. Parse the tag soup into a DOM
2. Serialize HTML from that DOM
3. Compare the start and the end result.

Perhaps any error reporting can be made *during* the parsing process. 
Henri Sivonen could probably answer the question if that is possible.

However, there is *much* talk about having a lint tool for HTML, that 
goes beyond what the validator does. So in addition to the above, there 
can be settings for stuff like:

- Implicit close of elements. Tolerate, require or drop all closing tags?
- Implicit elements - tolerate, require or drop (maybe require body but 
drop tbody...)?
- Shortened attributes - tolerate, require or drop?
- HTML 4 style type attributes on <script> and <style> - tolerate, 
require or drop?
- Explicit closing of void elements - tolerate, require or drop?
- Full XHTML syntax (convert both ways)
- Indentation. Preferably with an option not to have block elements with 
a very short text content not to be broken up into 3 rows as in Tidy today.

Besides purification and linting, such a tool/library can be used for:

- Security. This will require the possibility of white and/or 
blacklisting elements and attributes. And preferably also attribute values.

- HTML post processing. This will enable authors to see indented code, 
that is explicit, while at the same time such "waste" can be removed 
before gzipping. This would be akin to JS minification and it could be 
performed on the fly from within PHP, Python, Java, Ruby, C#, server 
side JS or whatever. It can also be done manually before uploading from 
the development environment to production - or it could be integrated 
into the uploading tool!

The *main* feature that Tidy has today, is the ability to handle 
templates, by preservering/ignoring PHP or other server side code. To 
what extent the HTML5 parser can be modified to handle that feature I do 
not know.

 From a maintenance and bug fixing POV, I see *huge* wins in having a 
common base for Tidy, the HTML5 validator and HTML parsing in Gecko.

But the actual possibility thereof is beyond my technical knowledge to 
evaluate.

-- 
Keryx Web (Lars Gunther)
http://keryx.se/
http://twitter.com/itpastorn/
http://itpastorn.blogspot.com/