[whatwg] [WA1] Formatting elements

Wed Jul 19 03:55:11 PDT 2006

Ian Hickson <ian at hixie.ch> wrote:

> On Mon, 17 Jul 2006, Stewart Brodie wrote:
> > 
> > I tried dry-running the algorithm for handling mis-nested formatting 
> > elements, but I ended up with a tree that looked very odd.  I can't 
> > believe that the output I ended up with is what the desired result of 
> > the algorithm is, so there is a mistake somewhere: either in my 
> > execution of the algorithm or in the algorithm itself.  I took the 
> > following fragment of HTML:
> > 
> > <DIV> abc <B> def <I> ghi <P> jkl </B> mno </I> pqr </P> stu
> >
> > the result I ended up with was equivalent to:
> > 
> > <DIV> abc <B> def <I> ghi </I> </B> <I> </I> <P> <I> <B> jkl </B> mno
> > </I> pqr </P> stu </DIV>
> 
> Looks right.  With that as input, my implementation outputs:
> 
>    5: Parse error: missing document type declaration.
>    38: Parse error: mismatched b element end tag (misnested tags).
>    47: Parse error: mismatched i element end tag (misnested tags).
>    57: Parse error: mismatched body element end tag (premature end of 
>    file?).
>    <html><head></head><body><div> abc <b> def <i> ghi 
>    </i></b><i></i><p><i><b> jkl </b> mno </i> pqr </p> 
>    stu</div></body></html>

Good - we do end up with exactly the same thing.

> > I know it's hard to see when written out textually, but note that for 
> > the text node 'jkl', the I and B elements are the wrong way around!
> 
> Wrong way with respect to what? They're the "right way" if you look at the

> end tags: </b> closes first, so it must be innermost! ;-)

I disagree because the 'jkl' is the bit I'm interested in here.  Are you
saying that the desirable tree order in defined in terms only of the closing
tags rather than the open tags?  In the original source, there haven't been
any close tags at all at the time the 'jkl' is parsed, ignoring the other
text nodes, the tree is:

<DIV> <B> <I> <P> jkl

(I don't really like the P being there, though, to be honest).  At this
point, jkl has a logical element hierarchy above it in the DOM tree that
matches what was in the original HTML source.  In CSS selector terms, "DIV >
B > I".  The subsequent processing of the </B> token causes such a selector
to no longer match (it has now changed to "DIV > I > B"):

<DIV> <B> <I> </I> </B> <P> <I> <B> jkl

Surely it is reasonable to expect the jkl to retain its ancestry - i.e. be a
child of the cloned I, which is a child of the cloned B, regardless of the
tag closure (of the B) that's about to occur, which would convert it to ...

<DIV> <B> <I> </I> </B> <P> <B> <I> jkl </I> </B> <I> (mno...)

I suppose the root of my concern is how to apply CSS selector matching in a
reasonable looking manner to the DOM tree if the parser has reversed the
parentage of the formatting elements.

> The point is this is error-correction logic, there is no "right way" 
> (well, until the spec is a standard, I guess).

Indeed I suspect that it may not be possible to define the one true way in
such a way that satisfies all content.

> > It all seems to start going wrong for me in step 7 of the algorithm.  
> > During the handling of the </B> tag, the clone of I gets created and 
> > that's the node that ends up being the childless I node that has the DIV

> > as its parent (during step 5 of handling the </I> tag when the I is 
> > cloned for a second time to be the child of the P and adopt the original

> > children of the P) Firefox generates what I think I would expect and 
> > prefer:
> > 
> > <DIV> abc <B> def <I> ghi </I> </B> <P> <B> <I> jkl </I> </B> <I> mno
> > </I> pqr </P> stu </DIV>
> 
> It's the same number of tags, in this case.
> 
> It gets more obviously bad to do what Mozilla does when you consider a 
> case like:
> 
>    <b><p>...<p>...<p>...<p>...<p>...<p>...
> 
> ...which is very common. With that exact markup, Safari, IE7, and the spec

> all end up with the exact same DOM tree (from the <body> down, at least), 
> and with the same number of element nodes (from <body> down, 8).
> 
> Mozilla ends up with 13 nodes (from the body down). That doesn't scale -- 
> there are pages with hundreds of nodes like this.

And it gets much worse if it was all wrapped in a <u> and <em> too. The key
is, as you mention in one of the blog entries linked below, that the
behaviour differs depending on whether or not the content is well-formed in
terms of matching order of start and end tags, or not.

> > For comparison, Internet Explorer 6 on the other hand treats the P no
> > differently to the B or I and ends up with:  <DIV> abc <B> def <I> ghi
> > <P> jkl </P> </I> </B> <I> <P> mno </P> </I> <P> pqr </P> stu </DIV>
> 
> Actually IE has only one P element (and only one B and only one I). Look 
> closer and you'll find that the P element isn't closed -- it's just that 
> the "mno" and "pqr" text nodes' parentNodes point to the P, while the DIV 
> element's childNodes array actually also mentions those text nodes. Yes, 
> IE generates DOM trees that aren't trees. See also:
> 
>    http://ln.hixie.ch/?start=1037910467&count=1
>    http://ln.hixie.ch/?start=1138169545&count=1
>    http://ln.hixie.ch/?start=1137740632&count=1
>    http://ln.hixie.ch/?start=1026485588&count=1
>    http://ln.hixie.ch/?start=1137799947&count=1

Yes, I have already read many of your blog entries on this topic.  I got the
impression that some of the behavioural discrepancies you were discovering
were driving you somewhat mad, which was in itself very amusing :-))  I
suspect that that is just the sort of amusement that can only really be
shared by those of us who have actually had to implement HTML parsers to
parse real-world web content in a similar-enough way to the most widely used
desktop web browsers.

I just don't like the idea of having to detach nodes from the DOM tree once
they have been attached.  The current algorithm is to allow any element
inside any other (pretty much) until a problem crops up at which point
there's a reorganisation required and that requires detachment (almost
always)

> > The problem here may simply be that appending any node due to opening 
> > any non-formatting/non-phrasing open tag when in "in body" should cause 
> > any formatting/phrasing elements to be popped off the stack of open 
> > elements, and then NOT execute "reconstruct the active formatting 
> > elements" (because it'll be executed automatically when opening the next

> > formatting/phrasing element or text node anyway)
> 
> Isn't that already the case? You only reconstruct for inline elements and 
> text nodes, as far as I can tell.

No, on both counts.  Firstly, you just append the new node regardless of
what's already on the stack; secondly, the algorithm as stated causes the
reconstruction to happen for P too.  That may be an error?

I'm also wondering about a change of behaviour for the formatting elements
that would remove the additional child-less I clone that ends up under the
DIV.  This is doable, but it leads to some additional complexity in the
handling of the list of active formatting elements.  The change would be
that an open tag for a,b,big... does NOT reconstruct and does NOT insert an
HTML element for the token.  Instead, it creates the node for the token and
appends it to the list of active formatting elements.  In other words, its
creation is deferred until a suitable point in the future.  Thus a
reconstruct would create it (and the node would have to be copied into the
stack of open elements rather than a new one created for these cases).  An
attempt to remove it from the AFE list would also create it (and then remove
it again immediately).  However, I think that this may affect the start and
end tag handling for many other elements too, so may not be worthwhile - I
haven't gone through the idea in detail.

> BTW while looking at this stuff this page may be of use:
> 
>    http://software.hixie.ch/utilities/js/live-dom-viewer/

Now I'll have to work out why that doesn't work in my browser ;-)

-- 
Stewart Brodie
Software Engineer
ANT Software Limited