[whatwg] several messages about XML syntax and HTML5

Mon Dec 4 11:33:55 PST 2006

Sam Ruby wrote:
[snip]
> HTML5 can do one better.  Instead of handling presentational MathML as a 
> special case, this support can be generalized.  When a non-HTML element 
> is encountered inside a HTML document, the parser could make one 
> additional check: does this attribute have a xmlns attribute defined? If 
> so, it can enter a "consume foreign markup" stage whereby these elements 
> are simply placed into the resulting DOM.  Such elements would therefore 
> be made available to processors like JavaScript, which could enable some 
> cool applications.
> 
[snip]
> 
> Finally (whew!) unlike Microsoft's mis-advertised and undocumented XML 
> data islands, theis "architected HTML extension sytax" would clearly and 
> unabashedly be parsed by HTML5 parser rules for things like comments and 
> attributes.
[snip]

An HTML parser relies on knowledge of the schema, so it's not easy to 
parse an arbitrary, unknown schema with an HTML parser.

For example, HTML offers no syntactic way to differentiate between 
"void" elements like <br> and normal elements like <div>. The parser 
just "knows" that BR is void.

Likewise, the content model of the <script> element is "hardcoded" into 
the parser; there's no way to discover it from the syntax alone. (I'll 
admit that there's no similar construct to the content model of <script> 
in XML, however, so this particular difference doesn't pose a problem.)

In order to handle custom elements in HTML while still allowing them to 
appear in the DOM, you'd have to make some rules such as that no void 
elements are allowed. You'd have to write otherwise-void elements as, 
say, <img></img> in order to have them handled correctly by the parser.

Even if you aren't constructing a DOM of these unknown elements, you 
need to be able to count opens and closes so that you can detect the end 
of the root custom element and resume normal parsing.

 > Standard browsers would be advised to ignore extensions that they
 > don't understand.  Including any text, so we don't have a repeat of
 > the <table> problem again.

I'm sure you realise this, but there are already browsers out there that 
*don't* ignore extensions that they don't understand, so a mandate such 
as this would be meaningless.