[whatwg] Thoughts on HTML 5

Wed Dec 17 13:43:48 PST 2008

On Wed, 17 Dec 2008, Giovanni Campagna wrote:
>
> I don't write browser code, honestly, but I think that XML parser don't 
> need to check for attribute types (they're all quoted strings),

XML parsers still have to check for quotes (" vs '), which takes no less 
time than HTML's checking for quotes (" vs ' vs nothing).

> don't check for element type (whether there can or must be closing tag), 

This doesn't cost any time in HTML either, since the tokeniser doesn't 
need to worry about what tags have end tags, the tree construction side 
just drops unexpected end tags on the floor.

> don't check for insertion modes

Having an insertion mode isn't particularly a performance cost. (It 
affects code footprint, but that's about it.)

> just parses the input completely any semantic or particular behaviour 
> associated with any tag. Then, when the DOMElement or DOMAttr or 
> DOM-whatever are built, they get the appropriate interface (eg. 
> HTMLElement) depending on the namespace.

That's the same as HTML.

> I think that the latter algorithm can be faster, but I actually haven't 
> got any benchmark (I cannot have, since no browser implements completely 
> HTML5 parse algorithm yet).

There are a number of HTML5 parser implementations, and data suggests that 
there is no particular performance gain.

> Secondly, stricter to me means: every warning is an error. Look in the
> following code:
> <div><p>some text</div>
> When the HTML parser find char 'd', i can imagine it expects char 'p' (as in
> </p>) and fallback to "quirk mode" otherwise, although no assertion are made
> in the official HTML spec.

Not at all, the </p> is completely optional in HTML, so that's not a 
problem. Also, it doesn't switch to quirks mode. The HTML5 spec defines 
how to handle these cases in excruciating detail.

> When parsing as XML, though, the parser can simply get the char: is it a
> 'p'? then go forward, else stop parsing
> no quirks, no trying to guess author intentions

There's no guessing in HTML either; all input streams have very specific 
and required results.

> what about <div><p>some text<p>some more text</div>?
> is it this: <div><p>some text</p><p>some more text</p></div>

Both of those are valid.

> or either this: <div><p>some text<p>some more text</p></p></div>

All three of these have very well-defined results. There's no ambiguity or 
guessing involved.

> And most of time strict checking means less errors caused by distraction 
> (misspelling of an end tag, missing '/' when self-closing elements not 
> always selfclosing, etc.)

Validating code is certainly an important QA point, but once you've 
shipped code, the presence of an error is not helpful to the end user. 
Often errors in XML files weren't present when the file was created, but 
appear later when new text is merged in automatically.

> Lastly, you said that deprecating HTML is vain. Well, IMO, if you 
> deprecate it now, maybe this year, or next year, or even the year after, 
> nothing will move. But (according to WHATWG Wiki) HTML spec will be 
> ready in 2020.
> 
> Do you think that in 12 years everybody will just ignore the string 
> "HTML is deprecated and should no longer be used"?

Well, they've ignored it for the past 7 years, so why would they change?

> By the way, XHTML1.0 / 1.1 said nothing about HTML4, they were 
> independent specifications.

They were positioned as replacements.

Anyway, it isn't clear that we would _want_ to deprecate HTML, even if we 
had any real choice in the matter.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'