[whatwg] Namespaces and tag names in the HTML parser

Thu May 30 07:34:08 PDT 2013

On Wed, May 29, 2013 at 3:19 PM, Ian Hickson <ian at hixie.ch> wrote:
> On Wed, 27 Feb 2013, Adam Klein wrote:
>>
>> Consider the following script:
>>
>> tr = document.createElement('tr')
>> tr.innerHTML = '<math><tr><mo><td>';
>>
>> That is, the fragment is parsed with tr as the context element. What
>> should the generated DOM be?
>
> Up to the <td> it's unambiguous and uncontroversial, I hope; and should
> be:
>
>    <html:tr>
>     <math:math>
>      <math:tr>
>       <math:mo>
>
> At the "<td>", you clear the stack back to a table row context, which pops
> all the nodes from the stack except the root one (the <html> one,
> representing the original <tr> element on which innerHTML was invoked).
>
> It thus results in:
>
>    <html:tr>
>     <math:math>
>      <math:tr>
>       <math:mo>
>     <html:td>
>
>
>> Note that <mo> is a "MathML text integration point", which causes the
>> <td> to be processed not as foreign content but as a normal HTML token.
>> This leads to the following DOM in WebKit:
>>
>> <tr>
>>     <math math>
>>         <math tr>
>>             <math mo>
>>     <td>
>>
>> (the "math" prefixes denote that these are elements with the MathML
>> namespace.)
>
> That is correct.
>
>
>> In Gecko, I instead get:
>>
>> <tr>
>>     <math math>
>>         <math tr>
>>             <math mo>
>>             <td>
>
> That is not.
>
>
>> The spec for what should happen to that <td> is the first step of
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-intr
>>
>> This case clearly seems like a bug in Gecko: it's treating the <math tr>
>> as if it's an HTML <tr>. That is, it's comparing only the local name (or
>> "tag name" as the spec usually refers to it).
>
> Right, that's wrong. The spec isn't ambiguous here, it explicitly says
> that the current node must be a <tr> or <html> element, not an element
> with a "tr" or "html" tag name, and <tr> and <html> elements are in the
> HTML namespace (they're even hyperlinked to their definitions).
>
>
>> But this same ambiguity exists elsewhere in the spec. For example, the
>> very next item under "in row" says "If the stack of open elements does
>> not have an element in table scope with the same tag name as the token"
>> (in this case, it's looking for a <tr>).
>
> Yeah, that text is wrong, because part of the rules look for <*:tr>, and
> part assume that only <html:tr> was matched. In fact, it means that
> tr.innerHTML = '<math><tr><mo></tr>' has no parse error and pops the root
> <html> off the tree! That's clearly bogus.
>
>
>> I think the HTML parser ought to specify more precisely how to deal with
>> namespaces in the stack of open elements, given that that stack can
>> contain elements of varying namespaces.
>
> It's not so much that it has to do it precisely (it does), it's that it
> has to do it accurately...
>
> There's a huge number of places in the spec that do tag name comparisons
> rather than element identity (tag+namespace) comparisons, and it's not at
> all clear to me that they should all change. Consider:
>
> On Fri, 15 Mar 2013, Rafael Weinstein wrote:
>>
>> I just opened another similar bug:
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=21292 which has a similar
>> root cause.
>>
>> I agree with Adam that it seems wrong that the stack of open elements
>> can contain elements in disparate namespaces, but its operation (at
>> times) only examines the local name (e.g. checking if an element is in a
>> specific scope, popping elements from the stack of open elements until
>> an element with the same tag name...)
>
> Well, as noted in the bug, I don't think we should check the namespace in
> _every_ case. The case in the bug is this:
>
>    <body><table><tr><td><svg><td><foreignObject></td>Foo<foo>
>
> This is clearly invalid; the question is, what <td> did the author mean to
> match, if any? It makes sense to me to match the most recently one. In

Not that I care very much to attempt to support DWIM in this way,
because I think allowing parser implementations to maintain a sane
invariant here is more important, but...

I think it's more likely the author was being lazy about closing all
the svg tags and simply wanted a quick way to say "I'm done with my
table cell"

> particular, consider these variations:
>
>    <body><table><tr><td><svg><zz><foreignObject></td>Foo<foo>
>    <body><table><tr><td><svg><zz><foreignObject></zz>Foo<foo>
>    <body><table><tr><zz><svg><zz><foreignObject></zz>Foo<foo>
>
>
>
> The cases in the spec now that are bogus are the cases where I mix one and
> the other. That actually means the opposite kind of change as is being
> proposed above: for example, it would mean changing the "table" end tag
> steps from what they say now (popping an HTML <table> element), to popping
> any "table" element regardless of namespace. This would make the algorithm
> more consistent, and remove the bugs mentioned above.
>
> Is this what people want to do? It's not what you (Adam) implemented, as I
> understand it.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'