[whatwg] Conformance class for XML tool-based non-browsers (was: Re: Unsafe SGML minimizations)

Fri Apr 14 04:29:18 PDT 2006

On Mar 11, 2006, Ian Hickson wrote:
> On Thu, 8 Sep 2005, Henri Sivonen wrote:
>> I am assuming that those situations do not arise if the document is
>> conforming and the loss of details that are lost in XML c14n does not
>> count as data loss. It would be very nice if you defined  
>> conformance in
>> such a way that this assumption held true. :-)
>
> Yes, conformant documents will be such that a conformant HTML5  
> document
> can always be serialised to a conforming XHTML5 document, I think.  
> If that
> ever turns out not to be the case, please raise the issue! I think  
> this is
> important because people use XML tools then serialise to HTML, and  
> vice
> versa (e.g. with CMSes that store data in custom formats).
[...]
>> I think the HTML5 spec should allow TagSoup to be updated for  
>> HTML5 or
>> an equivalent of TagSoup for HTML5 to be written. TagSoup  
>> guarantees to
>> the application that it acts as if it was an XML parser parsing  
>> XHTML.
>> Therefore, XML and, by extension, the SAX2 API contract restrict the
>> attribute names to legal XML attribute names. If HTML5 required "/ 
>> bar/"
>> to be reported as an attribute name, TagSoup would have to violate  
>> that
>> constraint and could not claim conformance.
>
> Well, <foo/bar/> gets parsed as <foo bar="">, but there are plenty of
> other ways to get non-XML-well-formed output from an HTML5 stream. For
> example, <foo \=> tokenises to a start tag token with an attribute  
> "\".
> I'm not convinced we don't want to do that.
>
> But then the HTML parsing model already requires the parser to  
> sometimes
> actively go in and modify the DOM on the fly, so I don't think it's
> possible to guarentee that it will look like an XML parser at all.

The spec says:
"User agents are not free to handle non-conformant documents as they  
please; the processing model described in this specification applies  
to implementations regardless of the conformity of the input documents."

I think this view of conformance is too restricted. Considering the  
availability of XML tools and how easy it would be to handle  
*conforming* HTML5 documents with those tools as well, I think the  
cost of supporting the processing model for HTML5 makes no sense  
considering the benefit in many (or most) non-browser scenarios.

Consider a non-browser UA that does not do scripting and is a  
conforming XHTML5 UA and is built on off-the-shelf XML tools. It  
would be nice for such an app to be able to ingest HTML5.

Dealing with conforming HTML5 documents would be simple: Just convert  
them into equivalent conforming XHTML5 documents and send them down  
the XML code path. If the UA does not run scripts delivered in  
content, this kind of conversion is pretty harmless. Even if the UA  
is a special-purpose CSS formatter, changes are that requiring the  
style sheets to be authored for XHTML is acceptable.

The benefits of hooking a TagSoup-like parser into the UA would be  
huge compared to the little work required to glue the additional off- 
the-shelf parser in.

But what about non-conforming HTML5 docs? On the XHTML5 side, ill- 
formed docs would be rejected outright. On the HTML5 side, however,  
the spec requires an elaborate error-handling model that leaks to the  
internals of the app and does not stay at the parser level. This has  
adverse effects:
  * Leaking to the app internals means that the off-the-shelf XML  
tools would perhaps need patching. XML 1.0 places rather arbitrary  
(but mandatory) restrictions eg. on attribute names. Having to pass  
on names that do not conform would potentially violate every  
documented XML-based API contract throughout the app (much like the  
infamous XML 1.1...).
  * The way the adoption agency algorithm requires changes to parts  
of the doc that have already been passed to the app would prevent the  
use of streaming XML APIs or would require full-DOM buffering first  
defeating the performance benefits of streaming APIs. Full-DOM  
buffering would be unwise in e.g. a high-throughput spider. :-)
  * Implementing the HTML5 processing model incurs a tremendous  
opportunity cost, because whoever has to do it could instead hook up  
a TagSoup-like parser and then work on new features.
  * If whoever has to implement it does it for money, there will be a  
significant accounting cost.
  * The adverse effects on software architecture will incur costs, too.

I think the model of processing HTML with XML tools is just too  
useful to be abandoned just because the spec requires a particular  
browser-oriented processing model. It would be nice for the spec to  
acknowledge this.

Therefore, I suggest calling apps that have the following properties  
conforming with some qualifications:
  * The UA is a conforming XHTML5 UA.
  * The UA is able to convert any conforming HTML5 document into a  
conforming XHTML5 document and process the result.
  * text/html byte streams that are not conforming HTML5 documents  
are not handled per spec (e.g. they are shoehorned into well-formed  
XHTML by means other than the prescribed parsing algorithm or they  
are rejected in a Draconian way or some combination of those).

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/