[imps] Problem with the tree-construction test cases and implied body

Tue Sep 11 03:24:41 PDT 2007

2007/9/11, Anne van Kesteren:
> On Tue, 11 Sep 2007 10:00:55 +0200, Thomas Broyer <t.broyer at gmail.com>
> wrote:
> > FYI, I've fixed it in Twintsam by testing for "head" in addition to
> > "body" in the EOF case of the main phase. The spec could read (changes
> > marked with <ins>):
>
> FWIW, I would like the specification to reflect html5lib where we did away
> with insertion modes and turned them all into phases (as the note in the
> specification suggests). I don't feel too strongly about it, but I think
> it would make the specification easier to read and maybe also more
> straightforward to implement.

Well, having separate phases and insertion modes allows for switching
from any phase back to the main phase without loosing the insertion
mode (for instance, I implemented the "general CDATA/RCDATA parsing
algorithm" as an additional phase) and without having to deal with
storing the "phase where you were in when were switched to the XXX
phase", which doesn't make the specification easier to read (YMMV).
There's such a "switch back to the attribute value state that you were
in when were switched into this state" in the tokenisation section
which is a bit of a mess: why doesn't the "consume an entity"
algorithm deal with the "if nothing is returned" case and the "entity
in attribute value" and the "entity data state" just go away?

On the other hand, adapting the "global" EOF case in the main phase to
always build head and body elements is trivial (at least for the head,
since we have a "head element pointer"; it's a bit less easier for the
body because of the body/frameset duality, but it could be solved by
just looking at the insertion mode: the insertion is never switched
back to "before head", "in head" or "after head" –there only are
"process as if we were in the XXX insertion mode" instructions–, so
if, at EOF, the insertion mode is one of these three values, it means
the tree has no body or frameset element, and we can safely append a
body element without attributes to the root node).

Proposed wording:
<<<
End end-of-file token:

    Generate implied end tags.

    If there are more than two nodes on the stack of open elements, or
if there are two nodes but the second node is not a head node or a
body node, this is a parse error.

    Otherwise, if the parser was originally created as part of the
HTML fragment parsing algorithm, and there's more than one element in
the stack of open elements, and the second node on the stack of open
elements is not a head node or a body node, then this is a parse
error. (fragment case)

    <ins>
    If the head element pointer is null, create an element node with
the tag name "head" and append it to the first element in the stack of
open elements (the html element).

    If the insertion mode is one of "before head", "in head", "in head
noscript" or "after head", create an element node with the tag name
"body" and append it to the first element in the stack of open
elements (the html element).
    </ins>

    Stop parsing.
>>>

It could also be solved with "act as if a XXX token with the tag name
YYY and no attribute had been seen and reprocess the current token"
(which would be more accurate given that the argument of not
generating a parse error is that head, body and html start and end
tags are optional):
<<<
    If the insertion mode is "before head", act as if a start tag
token with the tag name "head" and no attribute had been seen and
reprocess the current token.

    Otherwise, if the insertion mode is "in head noscript", act as if
an end tag token with the tag name "noscript" had been seen and
reprocess the current token.

    Otherwise, if the insertion mode is "in head" or "after head", act
as if a start tag token with the tag name "body" and no attribute had
been seen and reprocess the current token.
>>>

No need to duplicate the whole thing into the fifteen insertion modes
with only small variations in four of them.

N.B.: there probably needs to be some special handling for the
"fragment case", in which one I suppose the head element shouldn't
always be implied.

-- 
Thomas Broyer