[whatwg] Another bug in the HTML parsing spec?

Tue Oct 18 11:28:10 PDT 2011

On 10/17/11 5:47 PM, Ian Hickson wrote:
> On Mon, 17 Oct 2011, David Flanagan wrote:
>> In the HTML spec, "The rules for parsing tokens in foreign content"
>> include an algorithm for "any other end tag".  This is the algorithm at
>> the very end of
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.
>>
>> I think there are some problems with this algorithm and would appreciate
>> any insight anyone has:
>>
>> 1) Step 3 includes an instruction to jump to the last step in the list
>> of steps.  But the last step begins "Otherwise", which sounds like it is
>> an else clause.  Jumping into an else clause is confusing enough that I
>> wonder if there is an error in the algorithm wording.
> Yeah, that's bogus. The "last step" it's referring to has been removed (it
> used to reset the insertion mode). I've fixed the spec.
Thanks.  With that change, my problem #3 below goes away, as you suspected.
>> 2) I can't get all of the parser tests from html5lib to pass with this
>> algorithm as it is currently written.  In particular, there are 5 tests in
>> testdata/tree-construction/tests9.dat of this basic form:
>>
>> <!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
>>
>> As the spec is written, the<mi>  tag is a text integration point, so the "foo"
>> text token is handled like regular content, not like foreign content.
> Oh, my, yeah, that's all kinds of wrong. The text node should be handled
> as if it was in the "in body" mode, not as if it was "in table". I'll have
> to study this closer.
>
> I think this broke when we moved away from using an insertion mode for
> foreign content.
Here's my current workaround:

In 13.2.5, in the rules for whether to use the current insertion mode or 
to insert the token as foreign content, if the token is being inserted 
because the current node is a math (or HTML, but I'm not sure about 
that) integration point, then first set a text_integration_mode flag, 
then invoke the current insertion mode, then clear the flag.

And in the in table insertion mode, when a character token is inserted, 
and the text_integration_mode flag is set, then just process the token 
using in body mode, and otherwise follow the directions that are there now.

I'm not sure that is the best way to fix the spec, but it works for me, 
in the sense that my parser now passes the tests.

     David

> Henri, do you know how Gecko gets this right currently?
>
>
>> The workaround I've found (I'm not confident that this is the correct
>> workaround) is to change step 3 of the algorithm so that it only pops
>> the stack if there is no pending table text.  Another potential
>> workaround is to use the existence of pending table text as a condition
>> for sending tokens to the regular insertion mode rather than treating
>> them as foreign content.
> We shouldn't be ending up with pending table text here at all. It should
> go straight into the mi element.
>
>
>> 3) In this set of tests
>> http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat
>> there is this test:
>>
>> <math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>
>>
>> When the first</mrow>  tag is parsed, it is handled as foreign content,
>> and gets popped off the stack in step 3. Then, the token is reprocessed
>> in body mode.  It is treated in the "any other end tag" case.  Since the
>> top of the stack happens to be another mrow tag, that one gets popped
>> too.  (Other tests don't fail here because they don't happen to have two
>> of the same tags on the stack).  This means that the<mi>  element ends
>> up as a child of the<math>  element instead of the outer<mrow>  element.
> That should be fixed with the updated spec text now, right?
>