[whatwg] several messages about XML syntax and HTML5

Mon Dec 4 05:55:17 PST 2006

Ian Hickson wrote:
> I've been having a lot of trouble following this discussion, because I 
> can't work out what it is that is being asked for. There seem to be 
> multiple discussions going on, and it isn't clear to me that everybody 
> really knows what they are arguing for or against.
> 
> I've changed the spec to allow a (meaningless) "xmlns" attribute on the 
> root <html> element, for the same reasons /> is allowed on void elements 
> now. I don't think it's a particularly useful thing, but I'm curious to 
> see what people think. (Like anything in the spec, we might remove it in 
> due course, based on real world experiences with the spec.)

EX-CELL-ENT!

> * Possible Request A: We want a way to add proprietary markup to HTML 
> documents, and have them be usable by text/html browsers.
> 
> This won't work, because the browsers won't support that proprietary 
> markup. This has nothing to do with the specs. (The same problem exists in 
> XML.) For the same reason, proprietary markup is poor for accessibility. 
> HTML actually has a mechanism to add custom/proprietary semantics to 
> general HTML semantics, which works hand-in-hand with good accessibility 
> techniques and _does_ work in existing browsers, namely the "class", 
> "rel", and (for now) "profile" attributes. This is how microformats.org 
> work. This doesn't require any sort of XML markup.

s/usable by text\html browsers/ignored by pure text\/html browsers/

While "pure" text/html browsers is a substantial and unquestionably 
important use case for /X?HTML\d/, people always find inventive ways to 
(mis-)use data.

A clear (good) example: autodiscovery links: they were originally 
created to be used by non-text/html browsers, support by text/html 
browsers came later.

A considerably less good example: the RDF for trackback autodiscovery 
structured comments that are common across the web, including your weblog.

But, on balance, allowing others to "poach" on the xhtml namespace (can 
anyone say <blink> tag?  Better yet, how about <object>?) places 
practical limits on the HTML specification can evolve, hence:

> * Possible Request B: We want a way to add markup representing standard 
> vocabularies other than HTML (e.g. MathML, SVG, DocBook, RDF) to HTML 
> documents, and have them be usable by text/html browsers.
> 
> These should be raised as distinct feature requests. We're already looking 
> at adding Math markup to HTML (probably in a way compatible with MathML 
> renderer implementations). SVG is not semantically rich (it's 
> presentational), and so probably belongs not in the document layer (HTML) 
> but in the presentation layer (CSS+XBL) or the embedding layer (external 
> documents using <object> and fallback content for accessibility).

Yes, it would be great if SVG were hooked into the presentation layer 
and the embedding layer.  But one of my frustrations with XHTML 2.0 is 
that it tries to enforce layering.

<hr> is presentational, and exists in HTML5.  People use <br> for lists, 
and <table>s for layout.

MathML has a set of presentational and a set of content markup defined. 
  Guess which one people are clamoring to have supported by HTML5?  Yup, 
the presentational markup.

There are a number of reasons for this.  If you are writing a itex2mml 
processor, the last thing you want to do is create multiple artifacts 
that will inevitably get mismanaged.  Additionally, there are always 
going to be contexts where you literally only get one shot: the 
<description> element inside <rss> is a perfect example.

And tying back into the example above, designed in extensibility is 
important for non-browser applications.  We've all seen what MS Word 
does in the interest of supporting round tripping; but a (small) part of 
the blame has to be placed on HTML itself.  By not designing in 
extensibility, it practically invites these perversions.  (And, yes, I'm 
blaming the victim, which is morally equivalent to saying "but she was 
wearing a red dress, and therefore practically asking for it").

By designing in extensibility, you at least provide a basis for saying 
"put your crap here so at the very least I can step around it".  HTML 
has a rich history of "ignore tags you don't understand" which sometimes 
even works (though I'm sure you remember when <table> was new, and 
existing browsers at the time produced rather unreadable renderings).

HTML5 can do one better.  Instead of handling presentational MathML as a 
special case, this support can be generalized.  When a non-HTML element 
is encountered inside a HTML document, the parser could make one 
additional check: does this attribute have a xmlns attribute defined? 
If so, it can enter a "consume foreign markup" stage whereby these 
elements are simply placed into the resulting DOM.  Such elements would 
therefore be made available to processors like JavaScript, which could 
enable some cool applications.

Just to be clear: HTML5's definition of interoperability for such 
elements begins and ends at the DOM produced.  Nothing more.  Nothing 
less.  In particular, it would assign no further meaning to these elements.

Standard browsers would be advised to ignore extensions that they don't 
understand.  Including any text, so we don't have a repeat of the 
<table> problem again.  Browsers could, however, chose to look for such 
extensions.  "I see that this page utilizes MathML, do you want me to 
download the plugin that supports this for you?".

People who design embeddable markup languages would be advised to avoid 
names that already have meaning in HTML, and furthermore, be advised 
place all text in attributes, where practical.  And even if they don't, 
people who use these languages would be advised to avoid those features 
(for example, I avoid SVG's title).

Finally (whew!) unlike Microsoft's mis-advertised and undocumented XML 
data islands, theis "architected HTML extension sytax" would clearly and 
unabashedly be parsed by HTML5 parser rules for things like comments and 
attributes.  Namespace prefixes could not be used for elements. 
However, some thought would need to be given to things like xlink in 
MathML attributes.

> * Possible Request C: We want XML-style draconian error handling for 
> text/html.

Egads!  No.

> * Possible Request D: We want HTML-style graceful error handling for XML 
> content.
> 
> This is out of scope of the HTML5 specification.

+1

> * Possible Request E: We want to use XML syntactic sugar in HTML.
> 
> This wouldn't work, because new syntactic sugar in HTML would have to be 
> compatible with legacy content and legacy browsers. The XML syntactic 
> sugar (like <![CDATA[]]>) doesn't really work well in HTML (what with it 
> becoming a comment and all). Some things -- xmlns="", />, ' -- are 
> already allowed in HTML5. We could theoretically add PIs as well, I guess. 
> I don't see what else we could add. (And PIs are a bad idea in both XML 
> and HTML anyway, except for stylesheets, where for backwards compatibility 
> reasons we have to rely on <link> in HTML anyway.)

XHTML2 and HTML5 evolved from common ancestors.  Much of that evolution 
(in hindsight) did not give careful thought to migration and 
co-existence.  The inevitable results is that some XMLisms appear to 
work sometimes in some browsers, but not consistently.  And given 
browser vendor's desire to not break content once it is created -- even 
if it is not to spec -- they quickly become powerless to change this.

The WHATWG certainly didn't create this problem.  But since HTML5 
explicitly recognizes XHTML5 as a valid alternate serialization, these 
differences should not be left as an exercise to the student.

At this point, I feel compelled to say that I am pleased with the rapid 
progress that is being made towards addressing this.

Certain XMLisms should simply be accepted as they are widely supported 
and cause no harm.  Empty/void is clearly an example.  Certain XMLisms 
would cause great harm, like disallowing "]]>" as a sequence of 
characters in attribute values or open text.

IMHO, it should be a goal of this work group to striving to clearly 
separate these two cases; and furthermore, to strive to further separate 
the latter into two buckets: a (as large as humanly possible) bucket in 
which the XMLism is identified as a (recoverable) parse error, and a (as 
small as humanly possible) bucket in which the XMLism is not only valid 
as HTML, but interpreted in a different way.

> * Possible Request F: We want a powerful tool chain like the XML one.
> 
> By introducing HTML5 parsers and serialisers that plug onto the ends of 
> existing XML toolchains, we can leverage the XML tool chain without having 
> to force authors to use XML. Work is already in progress to enable this. 
> The parser specification (new in HTML5) enables this. The lack of such a 
> spec for previous versions of HTML is, IMHO, the reason why there has 
> never been a strong HTML tool chain.

This works only to the extent that the DOMs are compatible.

And, yes, a firm spec and HTML5 parsers and serializers would be most 
welcome.

But as to building up an entire HTML tool chain that rival's XMLs?  I 
would say that that is unlikely until there is some thought put into 
architected extensions.  Until then, the preferred technique for 
extracting things like trackback metadata will continue to be screen 
scraping with regular expressions.

The primary reason why that metadata was placed as a comment is that 
otherwise it would be flagged by the W3C validator.  It was not that it 
caused any interoperability problems, it is just that it would bother 
people who cared about things like spec compliance.  Specs would be well 
advised to avoid creating such tension.

> Down to specific e-mails sent over the weekend:
> 
> The problem is that the common subset would be just that -- a subset. The 
> common subset of HTML and XHTML has very few useful features!

Would you be willing to concede that that is open to debate?  The fact 
that my weblog and my planet are usefully viewable on Lynx is a counter 
example that is meaningful to me.

> It would be better to have hard data to work with, rather than having to 
> rely on our opinions of this. My own research does not suggest that most 
> authors use tools. That over three quarters of pages have major syntactic 
> errors leads me to suspect that tools are not going to save the syntax.

+1.

I'll add that most tools are created by fallible humans with only a 
shallow understanding of the relevant specifications.

> On Sat, 2 Dec 2006, Robert Sayre wrote:
>> It would not take much to add an "if the element has an 'xmlns' 
>> attribute" to the "A start tag token not covered by the previous 
>> entries" state in "How to handle tokens in the main phase" section of 
>> the document.
> 
> This would break millions of pages, sadly. There are huge volumes of pages 
> that have bogus xmlns="" attributes with all kinds of bogus values on the 
> Web today. I worked for a browser vendor in the past few years that tried 
> to implement xmlns="" in text/html content, and found that huge amounts of 
> the Web, including many major sites, broke completely. We can't introduce 
> live xmlns="" attributes to text/html.

All I ask is that you keep an open mind while we collectively explore 
whether there are extremely selective and surgical changes that can be 
made to html5 -- like the change to allow empty element syntax only on a 
handful of elements.

> On Sat, 2 Dec 2006, Sam Ruby wrote:
>> The question is: what would the HTML5 serialization be for the DOM which is
>> internally produced by the script in the following HTML5 document?
>>
>>   http://intertwingly.net/stories/2006/12/02/whatwg.logo
> 
> Currently, there wouldn't be one. We could extend HTML5 to have some sort 
> of way of doing this, in the future. (It isn't clear to me that we'd want 
> to allow inline SVG, though. It's an external embedded resource, not a 
> semantically-rich part of the document, IMHO.)

When you couple this answer with the concept of a generalized [X]HTML 
toolchain, the inevitable tendency would be to want a HTML5 deserializer 
on one end and an XHTML5 serializer on the other end.  And not just any 
XML deserializer, but one that limited itself to a subset of XML that 
could safely be processed by HTML5 deserializers.

If the spec explicitly disallows things useful to this toolchain, then 
the opportunity exists for somebody to move the discussion for what 
constitutes interop from "what does the spec say" to "what does this 
toolchain support".

As the set of DOMs that have a defined and interopable HTML5 
serialization grows, this picture changes to one in which having an 
HTML5 deserializer on one end and an HTML5 serializer on the other is 
increasingly attractive.

> On Sun, 3 Dec 2006, Sam Ruby wrote:
>> In the hopes that it will bring focus to this discussion:
>>
>> http://wiki.whatwg.org/wiki/HtmlVsXhtml
> 
> This has now been updated with a more complete list of differences.

Thanks!

- Sam Ruby