[html5] Tidy and HTML5
Keryx Web
webmaster at keryx.se
Fri Nov 26 06:02:17 PST 2010
X-posting to WHATWG help from html-tidy at w3.org
2010-11-26 09:36, Adrian Sandor skrev:
> As I mentioned before, my main concern is about bug fixes. I don't care much
> about HTML5 support at this time.
> (But if somebody else has a patch, I will be happy too)
>
Here is the deal with HTML5. Pretty soon every browser will have an
HTML5 parser. Except for IE, browsers do not have multiple parsers.
This means that tokenization and DOM tree building will follow the rules
defined in HTML5 - as opposed to not really following any rules at all,
since HTML 4 never defined them.
Simply put, there is no "opt out" of HTML5. An HTML 4 or XHTML 1.x
doctype is nothing more than a contract between developers. Technically
all it does is to set the browser in standards compliance mode.
Thus, I do not see any future in a tool that does not rely on the HTML5
parsing algorithm. Tidy can not grow from its current code base, but
needs to have the same html5lib at its core that is in the HTML5
validator, which basically is the same as the one being used in Firefox 4.
A basic "Tidy5" implementation would thus look like this:
1. Parse the tag soup into a DOM
2. Serialize HTML from that DOM
3. Compare the start and the end result.
Perhaps any error reporting can be made *during* the parsing process.
Henri Sivonen could probably answer the question if that is possible.
However, there is *much* talk about having a lint tool for HTML, that
goes beyond what the validator does. So in addition to the above, there
can be settings for stuff like:
- Implicit close of elements. Tolerate, require or drop all closing tags?
- Implicit elements - tolerate, require or drop (maybe require body but
drop tbody...)?
- Shortened attributes - tolerate, require or drop?
- HTML 4 style type attributes on <script> and <style> - tolerate,
require or drop?
- Explicit closing of void elements - tolerate, require or drop?
- Full XHTML syntax (convert both ways)
- Indentation. Preferably with an option not to have block elements with
a very short text content not to be broken up into 3 rows as in Tidy today.
Besides purification and linting, such a tool/library can be used for:
- Security. This will require the possibility of white and/or
blacklisting elements and attributes. And preferably also attribute values.
- HTML post processing. This will enable authors to see indented code,
that is explicit, while at the same time such "waste" can be removed
before gzipping. This would be akin to JS minification and it could be
performed on the fly from within PHP, Python, Java, Ruby, C#, server
side JS or whatever. It can also be done manually before uploading from
the development environment to production - or it could be integrated
into the uploading tool!
The *main* feature that Tidy has today, is the ability to handle
templates, by preservering/ignoring PHP or other server side code. To
what extent the HTML5 parser can be modified to handle that feature I do
not know.
From a maintenance and bug fixing POV, I see *huge* wins in having a
common base for Tidy, the HTML5 validator and HTML parsing in Gecko.
But the actual possibility thereof is beyond my technical knowledge to
evaluate.
--
Keryx Web (Lars Gunther)
http://keryx.se/
http://twitter.com/itpastorn/
http://itpastorn.blogspot.com/
More information about the Help
mailing list