[whatwg] several messages about XML syntax and HTML5

Ian Hickson ian at hixie.ch
Mon Dec 4 12:34:04 PST 2006

As far as I can tell, the various problems that have been raised have now 
been addressed. It is now possible to make an HTML5 document that is 
parseable as XHTML5, such that you can't tell which it is without looking 
at the MIME type (which of course you could ignore in your own toolchain). 
HTML5 has an extension mechanism (used to great effect by the 
microformats.org people, for example).

The other issue, supporting other vocabularies in HTML5, is an open issue, 
but it will be addressed in due course. We need more implementation 
experience first, and there are far more pressing problems.

There were various proposals involving how to process documents using 
multiple parsers with fallback options, etc, but based on my conversations 
with browser vendors, that wouldn't ever be widely supported. If you wish 
to propose this for the spec, please get the browser vendors to implement 
it first (as an experimental mode, e.g.), to demonstrate that they are 
willing to do so.

If there are specific proposals that I missed, please let me know. I think 
I replied to everything I saw.

On Mon, 4 Dec 2006, Mihai Sucan wrote:
> Doesn't [<html:* lang="">] validate in XHTML? If not, this is news to 
> me.

It doesn't, because if it did, there would be two separate ways of 
specifying the language in XHTML. xml:lang="" is part of the XML 
standards, it's how you do languages in XML. If you want to use XML, then 
you use the XML tools, and that means xml:lang="". Otherwise, why bother 
using XML? You wouldn't get any of the advantages. But see below:

On Mon, 4 Dec 2006, Michel Fortin wrote:
> The HTML5 spec currently gives the following authoring requirements regarding
> lang and xml:lang:
> > The lang attribute only applies to HTML documents. Authors must not use the
> > lang attribute in XML documents. Authors must instead use the xml:lang
> > attribute, defined in XML. [XML]
> I'd change it for this:
> "Only the lang attribute applies to HTML documents. For XHTML documents, 
> authors should instead use the xml:lang attribute as defined in XML, 
> although the lang attribute is also allowed for backward compatibility 
> reasons. If an element has both the lang and the xml:lang attributes 
> set, both attributes must have the same value."

The problem is the other way around. If you specify both, that means that 
you're intending to treat the document as both HTML and XML. But if you 
treat it as both, that means you're going to send xml:lang="" to the HTML 
processor, which will ignore it (since there's no "xml:lang" attribute in 
the null namespace). But then if you serialise that as XML, then you'll 
end up with three attributes. This is a mess. Much better is to have one 
attribute for each format, and then the parser can parse the document and 
set the language appropriately. When you output HTML, use lang="", when 
you output XML, use xml:lang="".

On Mon, 4 Dec 2006, Mike Schinkel wrote:
> >
> > I've been having a lot of trouble following this discussion Are there 
> > other requests? What are they?
> 1.) Minimize the changes *required* for existing documents to validate 
> as HTML5

This is already one of the (many) concerns being taken into consideration.

> 2.) Provide strategies that make transitionality possible, and provide 
> incentives for moving in that direction.

This, again, is being done. It should be especially easy to transition 
from either XHTML, HTML4, or tag soup to HTML5.

> For example, it could be a new media type that encompases many different 
> media types first trying XHTML and then trying HTML5, where the user 
> agent allows uses to try different subset media types if the one chosen 
> by the browser did not display well.

Several browser vendors have told me that this is not something they would 
consider implementing, so this isn't really an option.

> Even if the spec said they MUST do it, they wouldn't?  Even if the user 
> had the option to toggle through different renderings?

Correct. I believe Opera is the only browser that said they might do this 
(and indeed, they have a browser that does something like this now, from 
what I hear). However, other browsers have different target markets and 
therefore different design decisions. I have to make sure I listen to all 
of them.

> Of course the fastest to display would be XHTML, giving site owners a 
> reason to go with that. OR the user agent could be given the authority 
> to try the different ones whenever it sees text/html. (

It isn't clear to me why you think XHTML would be fastest. In practice, 
HTML is considerably better optimised in browsers than XHTML.

> 3.) Not to incorporate additions to HTML5 which cannot be added to 

This is already a design policy.

> 4.) Minimize the number of differences for people to have to learn and 
> implement HTML and XHTML.  That would mean avoid divergence whenever 
> possible.  This could even mean planning to change XHTML at some point 
> in the future.  Or it could mean having the W3C deprecate XHTML and 
> withdrawn it from recommended use.

XHTML5 is not really intended to be used, it's only defined for the 
purposes of making sure XML users don't try to each invent their own 
version, resulting in dozens of incompatible versions.

HTML5 as text/html is the main serialisation format for HTML5.

> > I have huge doubts that this would pass even elementary usability 
> > testing, because most users would just say "I don't care".
> But that's the thing; usability wouldn't matter; let users ignore it.  
> But site owners would fear that it turned off users and they would then 
> be motivated to fix it, especially if the echo chamber that clamors for 
> standards makes a big stink about it.  You need a forcing factor to 
> empower change; Google (and Yahoo and MSN) could make it happen.  Hell, 
> why not test it for a while (like you tested Google Answers) and see if 
> it works or not. Certainly it couldn't be worse than not doing it.

Well, this is out of scope for the WHATWG, but I encourage you to speak to 
browser vendors and search engines and see what they say.

> Why not have text/html5?  (If that's a stupid question, please realize 
> I'm writing this really late.)

There are billions of documents today that use text/html. That's the 
legacy that we're trying to be compatible with. A new MIME type is what 
XHTML attempted to require, and I think it's clear that that didn't work.

> How about *real* XML Data Islands then?

What would those be?

> > Drop the string concatenation, and move to outputting HTML5 using an 
> > XML pipeline with an HTML5 serialiser on the end. (This would 
> > basically mean dropping the HTML4 code, simplifying the CMS, and 
> > making very few changes to the XML serialiser.)
> >
> > Move to HTML5 with an XML pipeline. (This is basically the same as 
> > number 6, except that there's no code to drop first.)
> You do realize that this will happen over a period of many years if it 
> happens at all?  And in the mean time...?

In the mean time... what? HTML5 won't be "complete" for decades, I don't 
see what the problem is here. Everything we're doing here is on a large 

> > > That's an excellent point. My answer is that I was sold on the 
> > > benefits of XHTML, and I still believe in them so I don't want to 
> > > give up on the hope that I can eventually get there.
> >
> > Just out of interest, could you say what those are? It's likely that 
> > HTML5 actually has the same benefits, so that you don't lose them by 
> > using HTML5 instead of XHTML5.
> * A single direction, not multiple (Fan in, not fan out)

XHTML, which introduced a new format, provides a single direction? I'm 
confused. I thought it was the introduction of XHTML that introduced 
multiple formats!

Anyway, with HTML5, you have a single direction: HTML5-as-text/html.

> * Reduction is the number of ways to do things (XHTML vs. HTML) so we 
> don't have to ponder over too many options.

I'm confused as to how XHTML, which introduced a new way of doing things, 
reduced the number of ways of doing things.

Anyway. Just consider HTML5-as-text/html to be your only language, and 
you'll be set. (Some people, who still want XML for some reason, can use 
XHTML5 in their pipeline, but that's not relevant for text/html, and you 
don't need to worry about it or use it.)

> Okay, as I'm reading all the writing about there being no point to XHTML 
> vs. HTML 5 let me make this distinction between XHTML and HTML:
> * For XHTML there is mostly one good way to do this.
> * Because HTML is so lax, there is no standardized and universally 
> accepted advice (AFAICT) for what constitutes the best way to code an 
> HTML document.
> Will there (could there be) a subset of HTML5 that is presented as the 
> preferred way to code HTML5? And if XHTML is going to continue to exist, 
> can that preferred way be as close to XHTML as possible?

The "one way" is the way described here:


On Mon, 4 Dec 2006, Anne van Kesteren wrote:
> > 
> > You should ask yourself, though, why is it that you want to use XML, 
> > if you don't like what it implies?
> XML is currently the only way to distribute a feed. I know there's 
> hAtom, but it's unclear to me how well supported it is by various feed 
> readers. Given that all feed formats support some kind of HTML tag soup 
> it would seem indeed better to just have an HTML format for feeds but 
> currently there isn't any.

Feed formats like Atom and hAtom are outside the scope of the WHATWG.

> XML is currently the only way to create an SVG DOM without having to 
> write your graphic using ECMAScript and DOM methods (and then it 
> wouldn't work for background-image etc. unless you allow HTML content 
> there which could arguably be allowed).

SVG is a presentational-level concern, so probably not relevant to HTML. 
But that's a separate discussion and doesn't have to have anything to do 
with XML itself.

> XML seems to be the only way to create XBL2 content in the future even 
> though XBL2 is described in terms of the DOM and not in terms of a 
> particular markup language.

XBL is out of scope for HTML.

> Those are the reasons why I want to use XML even though I don't like 
> what it implies.

Ok... but none of those have anything to do with HTML-as-XML. :-)

On Mon, 4 Dec 2006, Sam Ruby wrote:
> > 
> > * Possible Request A: We want a way to add proprietary markup to HTML 
> > documents, and have them be usable by text/html browsers.
> > 
> > This won't work, because the browsers won't support that proprietary 
> > markup. This has nothing to do with the specs. (The same problem 
> > exists in XML.) For the same reason, proprietary markup is poor for 
> > accessibility. HTML actually has a mechanism to add custom/proprietary 
> > semantics to general HTML semantics, which works hand-in-hand with 
> > good accessibility techniques and _does_ work in existing browsers, 
> > namely the "class", "rel", and (for now) "profile" attributes. This is 
> > how microformats.org work. This doesn't require any sort of XML 
> > markup.
> s/usable by text\html browsers/ignored by pure text\/html browsers/

Sure. As noted above, HTML does allow this, using the microformats.org 
architecture. This is unrelated to XML.

> > * Possible Request B: We want a way to add markup representing 
> > standard vocabularies other than HTML (e.g. MathML, SVG, DocBook, RDF) 
> > to HTML documents, and have them be usable by text/html browsers.
> > 
> > These should be raised as distinct feature requests. We're already 
> > looking at adding Math markup to HTML (probably in a way compatible 
> > with MathML renderer implementations). SVG is not semantically rich 
> > (it's presentational), and so probably belongs not in the document 
> > layer (HTML) but in the presentation layer (CSS+XBL) or the embedding 
> > layer (external documents using <object> and fallback content for 
> > accessibility).
> Yes, it would be great if SVG were hooked into the presentation layer 
> and the embedding layer.  But one of my frustrations with XHTML 2.0 is 
> that it tries to enforce layering.

XHTML2 is also out of scope for WHATWG. :-)

> <hr> is presentational, and exists in HTML5.  People use <br> for lists, 
> and <table>s for layout.

<hr> is not presentational in HTML5 (check how it is defined).

People abuse markup, yes, but I don't see how that is relevant here.

Anyway, as I said, SVG in HTML is something being looked at. We'll see. It 
doesn't have to involve doing anything with XML, and if it does, would be 
done on its own merits, not as part of a larger "support XML" campaign.

> By designing in extensibility [...]

HTML has a well-defined extensionability model, as used by the 
Microformats community. It's even got a good accessibility story.

> HTML5 can do one better.  Instead of handling presentational MathML as a 
> special case, this support can be generalized.  When a non-HTML element 
> is encountered inside a HTML document, the parser could make one 
> additional check: does this attribute have a xmlns attribute defined? If 
> so, it can enter a "consume foreign markup" stage whereby these elements 
> are simply placed into the resulting DOM.  Such elements would therefore 
> be made available to processors like JavaScript, which could enable some 
> cool applications.

This, unfortunately, would break a significant number of existing pages 
(there is a LOT of bogus xmlns=""-using content on the Web today, and 
implementation experience with trying to do what you describe was very 

> But as to building up an entire HTML tool chain that rival's XMLs?

No need, just sticking HTML5 parsers and serialisers onto the end of an 
XML toolchain is enough.

> Until then, the preferred technique for extracting things like trackback 
> metadata will continue to be screen scraping with regular expressions.

I believe pingback shows quite clearly that extension mechanisms for such 
things already exist and that the fact that trackback doesn't use them is 
not a fault of HTML.

> > The problem is that the common subset would be just that -- a subset. 
> > The common subset of HTML and XHTML has very few useful features!
> Would you be willing to concede that that is open to debate?

Well, yes, we're debating it. :-)

> The fact that my weblog and my planet are usefully viewable on Lynx is a 
> counter example that is meaningful to me.

My point is that if you used HTML5 instead, you would have _more_ features 
available to you, and at the same time, you would still be compatible with 
Lynx. And with an HTML5 parser on the front, you'd still be compatible 
with your XML pipeline.

> All I ask is that you keep an open mind while we collectively explore 
> whether there are extremely selective and surgical changes that can be 
> made to html5 -- like the change to allow empty element syntax only on a 
> handful of elements.

I think it should be obvious by now that I am indeed keeping an open mind. 
I'd ask you to keep an open mind as well. :-)

On Mon, 4 Dec 2006, Elliotte Harold wrote:
> Ian Hickson wrote:
> > 
> > * Possible Request A: We want a way to add proprietary markup to HTML
> > documents, and have them be usable by text/html browsers.
> > 
> > This won't work, because the browsers won't support that proprietary markup.
> > This has nothing to do with the specs. (The same problem exists in XML.) 
> That depends on the antecedent of "them".

The features.

> My request is that I be able to add proprietary markup to HTML 
> documents, and have the *documents* be usable by text/html browsers. My 
> own non-browser applications can make use of the proprietary markup.

Ok. HTML supports this today, in both the HTML and XML serialisations, 
using class values and rel types. Microformats.org is the community that 
is most actively working with these mechanisms, but the mechanisms are 
open to anyone to use. As I mentioned above, this even has a pretty decent 
accessibility story (which is unusual for extension mechanisms).

On Mon, 4 Dec 2006, Elliotte Harold wrote:
> Well, no. I do use XML parsers to process HTML documents on a regular 
> basis. I don't do it for all documents, but I profitably do it for quite 
> a few. As long as it's possible to make the documents well-formed, it's 
> possible to parse them with an XML parser because they are XML.

Ok, well, it is now possible to use a subset of HTML and XML that is 
compatible, so you can now enjoy whatever flexilibity this gives you. :-)

On Mon, 4 Dec 2006, Elliotte Harold wrote:
> >
> > What is it about XML that you like, that you don't get with HTML, that 
> > makes you request that we make HTML more like XML?
> I'm not sure which HTML you're talking about here, but
> 1. A reliable, practical tool chain including XSLT

This will be available once we have HTML5<->XHTML5 convertors. Work on 
this is actively happening as we speak.

> 2. Extensibility. I want to embed the markup I need without blowing up 
> browsers.

This is already supported in HTML5, as mentioned above.

On Mon, 4 Dec 2006, Mike Schinkel wrote:
> The irony is I'm not proposing much; just have as a design axiom that 
> the trajectory of HTML5 and XHTML should aimed toward convergence when 
> technically possible.

This is already the case (but it's not often technically possible).

On Mon, 4 Dec 2006, Sander Tekelenburg wrote:
>>> HTML 5 can include an ESP engine spec;
>> ESP engine spec?
> The "ESP engine" is that part of the browser that guesses what authors 
> might have meant when the parser runs into crap posing as HTML.

Yes, HTML5 includes this. It isn't ESP. It's a very carefully and 
unambiguously defined specification.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list