[whatwg] <!DOCTYPE html><body><table><math><mi>foo</mi></math></table> and other parser questions

Wed Dec 14 10:58:31 PST 2011

On Tue, Dec 13, 2011 at 2:32 PM, Ian Hickson <ian at hixie.ch> wrote:
> On Mon, 12 Dec 2011, Adam Barth wrote:
>> I'm trying to understand how the HTML parsing spec handles the following case:
>>
>> <!DOCTYPE html><body><table><math><mi>foo</mi></math></table>
>>
>> According to the html5lib test data, we should parse that as follows:
>>
>> | <!DOCTYPE html>
>> | <html>
>> |   <head>
>> |   <body>
>> |     <math math>
>> |       <math mi>
>> |         "foo"
>> |     <table>
>>
>> However, I'm not sure whether that's what the spec actually does.
>>
>> Consider point at which we parse the "f" character token (from "foo").
>>  The insertion mode will be "in table".  The spec will execute as
>> follows:
>>
>> -> If the current node is a MathML text integration point and the
>> token is a character token
>>   * Process the token according to the rules given in the section
>> corresponding to the current insertion mode in HTML content.
>>
>> -> A character token
>>   * Let the pending table character tokens be an empty list of tokens.
>>   * Let the original insertion mode be the current insertion mode.
>>   * Switch the insertion mode to "in table text" and reprocess the token.
>>
>> -> Any other character token
>>   * Append the character token to the pending table character tokens list.
>>
>> ... the "o" and "o" will be processed similarly and end up in the
>> pending table character tokens list.
>>
>> Now, consider the </mi> token.  We're still at a MathML text
>> integration point, but the current token is neither a start token
>> (with certain names) nor a character token, so we process the token
>> according to the rules given in the section for parsing tokens in
>> foreign content.
>>
>> -> Any other end tag
>>   * Run these steps:
>>     ...
>>
>> The net result of which is popping the stack of open elements, but not
>> flushing out the pending table character tokens list.  The list will
>> eventually be flushed when we process the </table> token, resulting
>> these character tokens getting foster parented:
>>
>> | <!DOCTYPE html>
>> | <html>
>> |   <head>
>> |   <body>
>> |     <math math>
>> |       <math mi>
>> |     "foo"
>> |     <table>
>
> On Tue, 18 Oct 2011, David Flanagan wrote:
>>
>> Here's my current workaround:
>>
>> In 13.2.5, in the rules for whether to use the current insertion mode or
>> to insert the token as foreign content, if the token is being inserted
>> because the current node is a math (or HTML, but I'm not sure about
>> that) integration point, then first set a text_integration_mode flag,
>> then invoke the current insertion mode, then clear the flag.
>>
>> And in the in table insertion mode, when a character token is inserted,
>> and the text_integration_mode flag is set, then just process the token
>> using in body mode, and otherwise follow the directions that are there
>> now.
>>
>> I'm not sure that is the best way to fix the spec, but it works for me,
>> in the sense that my parser now passes the tests.
>
> I think the real problem is that there's no need to go into the "table
> text" mode if the current node is not a table model element. So I've
> changed the spec at that point.
>
> Please let me know if that doesn't fix the test case or causes any other
> regressions.

That fix seems to work great.

Thanks!
Adam