[whatwg] [webcomponents] Template element parser changes => Proposal for adding DocumentFragment.innerHTML

Sat May 5 04:12:57 PDT 2012

[This time from the right email]

On Sat, May 5, 2012 at 3:39 AM, Rafael Weinstein <rafaelw at google.com> wrote:
> Let me back up here and say that I'm fine accomplishing the goal in a
> variety of ways. If this way isn't the best, I'm happy to go another
> way -- I'd just like help understanding the reasons why.
>
> On Fri, May 4, 2012 at 3:26 PM, Ian Hickson <ian at hixie.ch> wrote:
>> On Fri, 4 May 2012, Rafael Weinstein wrote:
>>> On Fri, May 4, 2012 at 2:46 PM, Ian Hickson <ian at hixie.ch> wrote:
>>> > On Fri, 4 May 2012, Rafael Weinstein wrote:
>>> >>
>>> >> This is the current proposal:
>>> >>
>>> >> http://lists.w3.org/Archives/Public/public-webapps/2012AprJun/0334.html
>>> >
>>> > I don't really understand the proposal.
>>> >
>>> > How does it relate to the template feature?
>>>
>>> The contents of <template> need to parse context-free (or implied
>>> context, or whatever). This adds the notion to HTML parsing so that
>>> <template> can use it.
>>>
>>> e.g. <template><tr><td>Foo</td></tr></template>
>>
>> I don't understand how this would work in the parser. The parser doesn't
>> have a "context element" concept, that's only for fragment parsing. If you
>> reset the insertion mode in the parser, it uses the stack of open
>> elements, which would always be a <template> element in this case when
>> you parse the <tr>.
>
> It would essentially be nested fragment parsing. As soon as the tree
> construction encounters a <template>, it goes into a nested fragment
> case. Conceptually, it pushes a DocumentFragment onto the stack of
> open elements, (leaves the tokenizer in the DATA state), then queues
> tokens which cannot change the state (DOCTYPE, endTag, comment,
> character) until it finds the first start tag. Set's the context
> element, resets the insertion mode appropriately, then processes the
> queued tokens and continues processing from the input stream.
>
> That said, I was intending to focus on DocumentFragment.innerHTML as a
> first step because I think the <template> element is more complicated
> and less certain, so it kind confuses the issue. I feel confident that
> whatever solution we come up with for this will work for the
> <template> element and the other issues with <template> element are
> orthogonal.
>
>>
>>
>>> > What does it do in the case of:
>>> >
>>> >   var frag = document.createDocumentFragment();
>>> >   frag.innerHTML = 'bla bla .. 1GB of text .. bla <caption> bla' ?
>>>
>>> Queue up pending tokens until you see the first start tag token or the
>>> end of file. The webkit implementation is here:
>>>
>>> https://bugs.webkit.org/attachment.cgi?id=140125&action=review
>>
>> So:
>>
>>   frag.innerHTML = 'bla bla .. 1GB of text .. bla <caption> bla';
>>
>> ...results in a document fragment with one node containing " bla", while:
>>
>>   frag.innerHTML = 'bla bla .. 1GB of text .. bla <caqtion> bla';
>>
>> ...results in a document fragment with a 1GB text node, an unknown element
>> <caqtion>, and another text node?
>>
>> That seems pretty weird.
>
> This isn't introducing the weirdness. It's already in the HTML parser.
>
> Show me any solution that uses the HTML parser and I'll show you input
> that produces weird output.
>
>>
>>
>>> > Why do we imply a tbody if the input is "<tr></tr><div></div>"?
>>>
>>> Because there's nothing better to do.
>>
>> I think almost anything else would be better. :-)
>>
>> In particular, I think having the output be a <tr> element and <div>
>> element as siblings would be better, as would having the output be just a
>
> That is what you get. The output is equivalent to:
>
> document.createElement('tbody').innerHTML = "<tr></tr><div></div>";
>
> which is a <tr> with a <div> nextSibling
>
>> <tr> element or just a <div> element.
>>
>>
>>> > Since you need the context element to know how to initialise the
>>> > tokeniser, how do you find the first tag?
>>>
>>> You always start in the DATA state. Can you think of a case where this
>>> won't work?
>>
>> You describe the change as a "mere addition", but it sounds much more
>> invasive than that if you're going to assume a context element and then
>> change it later.
>>
>> It sounds like what you're really proposing is not to change the context
>> element but to have the parser start off in some new mode where we just
>> wait for the first open tag, and then we do some substitution to get a
>> surrogate node, and try to reset based on that surrogate node's name
>> instead of the stack of open elements.
>
> These seems like subjective evaluations. I defer to your judgement,
> but it'd be helpful to understand the objection at a more concrete
> level.
>
> For example, if there is a technical problem with the approach, what
> is it? Does it introduce inflexibility in extending the parser later?
> Honestly, the parser is pretty dense -- I'm just trying to be the
> midwife of a sensible solution here, and I'm not married at all to
> this approach.
>
> As far as how invasive the change is, I'd like to think that the
> Webkit patch above points to it's non-invasiveness, in that it changes
> no logic in the tokenizer, and no construction logic -- only adding
> the notion that the tree construction is currently waiting to know
> what the next start tag is before it can continue construction.
>
>>
>> That seems pretty weird to me, but certainly isn't the weirdest thing
>> that's been proposed.
>>
>> Do we have a page or e-mail somewhere that documents all the cases we're
>> trying to support?
>
> The best write up is Yehuda's initial request:
>
> http://lists.w3.org/Archives/Public/public-webapps/2011OctDec/0663.html
>
> I'm happy to build a complete list of input examples that need to
> produce specific input, if that will be helpful.
>
> Another thing to note that may not be apparent from Yehuda's request
> or the <template> element spec that Dimitri wrote is that, while the
> context element isn't known programmatically at innerHTML or parse
> time, it *is* known by the author of the content. In other words, the
> input markup is always intended to be children of a specific element.
> That's why jQuery implements this exactly the way I'm proposing:
> Generally, the context element will be implied by *all* of the
> top-level start tags -- picking the first one is just a sensible way
> to have deterministic output. I know of no use cases for attempting to
> do something "useful" with input that has a "mixed implied context
> element".
>
>>
>> --
>> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
>> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
>> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'