[whatwg] We should not throw DOM Consistency and Infoset compatibility under the bus

Henri Sivonen hsivonen at iki.fi
Fri Jan 11 02:29:42 PST 2013

Hixie wrote in https://www.w3.org/Bugs/Public/show_bug.cgi?id=18669#c31 :
> I think it's fine for this not to work in XML, or require XML changes,
> or use an attribute like xml:component="" in XML. It's not going to
> be used in XML much anyway in practice. I've already had browser
> vendors ask me how they can just drop XML support; I don't think
> we can, at least not currently, but that's the direction things are
> going in, not the opposite.

This attitude bothers me. A lot.

I understand that supporting XML alongside HTML is mainly a burden for
browser vendors and I understand that XML currently doesn't get much
love from browser vendors. (Rewriting Gecko's XML load code path has
been on my to-do list since July 2010 and I even have written a design
document for the rewrite, but actually implementing it is always of
lower priority than something else.) Still, I think that as long as
browsers to support XHTML, we'd be worse off with the DOM-and-above
parts of the HTML and XML implementations diverging. Especially after
we went through the trouble of making them converge by moving HTML
nodes into the XHTML namespace.

But I think it's wrong to just consider XML in browsers, observe that
XML in browsers is a burden and then conclude that it's fine for stuff
not to work in XML, to require XML changes or to have a different
representation in XML. XML has always done better on the server side
than on the browser side. I think it's an error to look only at the
browser side and decide not to care about XML compatibility anymore.

When designing Validator.nu, inspired by John Cowan’s TagSoup, I
relied on the observation that XML and valid HTML shared the data
model, so it was possible to write an HTML parser that exposed an API
that looked like the API exposed by XML parsers and then build the
rest of the application on top of XML tooling. In the process of
writing the parser, in addition to supporting the XML API I needed for
Validator.nu I also added support for a couple of other Java XML APIs
to make it easy for others to drop the parser into their XML-oriented
Java applications. Then I got my implementation experience documented
in the spec as the infoset coercion section. I also advocated for the
DOM Consistency design principle in the HTML Design Principles. (I
also advocated this approach in the HTML–XML Task Force, and I believe
that the feasibility of using an HTML parser to feed into an XML
pipeline in addition to making good technical sense for real software
has been useful in calming down concerns about HTML among XML-oriented

Interestingly, the first ideas that were conceived unaware of these
efforts to make HTML parsing feed into an XML-compatible data model
and that threatened the consistency of the approach came from the XML
side: ARIA with colons (aria:foo instead of aria-foo) and RDFa with
prefix mappings relying on the namespace declarations (xmlns:foo). We
were successful at getting ARIA to change not to break the data model
unification. Since then, RDFa has downplayed the use of xmlns:foo even
though it hasn't completely eradicated it from the processing model.

Now it seems that threats to DOM Consistency and Infoset compatibility
come from the HTML side.

The template element radically changes the data model and how the
parser interacts with the data model by introducing wormholes.
However, this is only browser-side radicalness and a complete
non-issue for server-side processes that don't implement browser-like
functionality and only need to be able to pass templates through or to
modify them as if they were normal markup. These systems don't need to
extend the data model with wormholes—they can simply insert the stuff
that in browsers would go into the document fragment on the other side
of the wormhole as children of the template element.

The idea to stick a slash into the local name of an element in order
to bind Web Components is much worse. Many people probably agree that
the restrictions on what characters you can have in an XML name where
a bad idea. In fact, even XML Core thought the restrictions were a bad
idea to the extent they relaxed them for the fifth edition. But for
better or worse, existing software can and does enforce the fourth
edition NCNameness of local names. This isn't about whether the
restrictions on XML Names were a good or bad idea in the first place.
This isn't about whether it's okay to make changes to the HTML parsing
algorithm. This isn't about whether the error handling policy of XML
parsing is a bad idea and should be replaced with XML5/XML-ER. This is
about how *existing* XML data model *implementations* behave. Sure,
the reason why they behave the way they do is that they try to enforce
the serializability of the data model as XML 1.0 (4th ed. or earlier)
+ Namespaces, but that's not the key point. The key point is that
NCName enforcement exists out there in software that would be useful
for people working with HTML on the server side as long as HTML fits
into the XML data model.

I think it would be a mistake to change HTML in such a way that it
would no longer fit into the XML data model *as implemented* and
thereby limit the range of existing software that could be used
outside browsers for working with HTML just because XML in browsers is
no longer in vogue. Please, let's not make that mistake.

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list