[whatwg] Wasn't there going to be a strict spec?

Fri Aug 10 13:05:36 PDT 2012

On Fri, Aug 10, 2012 at 12:45 PM, Erik Reppen <erik.reppen at gmail.com> wrote:
> My understanding of the general philosophy of HTML5 on the matter of
> malformed HTML is that it's better to define specific rules concerning
> breakage rather than overly strict rules about how to do it right in the
> first place but this is really starting to create pain-points in
> development.
>
> Modern browsers are so good at hiding breakage in rendering now that I
> sometimes run into things that are just nuking the DOM-node structure on
> the JS-side of things while everything looks hunky-dorey in rendering and
> no errors are being thrown.
>
> It's like the HTML equivalent of wrapping every function in an empty
> try/catch statement. For the last year or so I've started using IE8 as my
> HTML canary when I run into weird problems and I'm not the only dev I've
> heard of doing this. But what happens when we're no longer supporting IE8
> and using tags that it doesn't recognize?
>
> Why can't we set stricter rules that cause rendering to cease or at least a
> non-interpreter-halting error to be thrown by browsers when the HTML is
> broken from a nesting/XML-strict-tag-closing perspective if we want? Until
> most of the vendors started lumping XHTML Strict 1.0 into a general
> "standards" mode that basically worked the same for any declared doctype, I
> thought it was an excellent feature from a development perspective to just
> let bad XML syntax break the page.
>
> And if we were able to set such rules, wouldn't it be less work to parse?
> How difficult would it be to add some sort of opt-in strict mode for HTML5
> that didn't require juggling of doctypes (since that seems to be what the
> vendors want)?

The parsing rules of HTML aren't set to accommodate old browsers,
they're set to accommodate old content (which was written for those
old browsers).  There is an *enormous* corpus of content on the web
which is officially "invalid" according to various strict definitions,
and would thus not be displayable in your browser.

As well, experience shows that this isn't an accident, or just due to
"bad authors".  If you analyze XML sent as text/html on the web,
something like 95% of it is invalid XML, for lots of different
reasons.  Even when authors *know* they're using something that's
supposed to be strict, they screw it up.  Luckily, we ignore the fact
that it's XML and use good parsing rules to usually extract what the
author meant.

There are several efforts ongoing to extend this kind of non-strict
parsing to XML itself, such as the XML-ER (error recovery) Community
Group in the W3C.  XML failed on the web in part because of its
strictness - it's very non-trivial to ensure that your page is always
valid when you're lumping in arbitrary user content as well.

Simplifying the parser to be stricter would not have any significant
impact on performance.  The vast majority of pages usually pass down
the fast common path anyway, and most of the "fixes" are very simple
and fast to apply as well.  Additionally, doing something naive like
saying "just use strict XML parsing" is actually *worse* - XML all by
itself is relatively simple, but the addition of namespaces actually
makes it *slower* to parse than HTML.

~TJ