[whatwg] [WA1] Formatting elements

Fri Jul 21 19:17:47 PDT 2006

On Wed, 19 Jul 2006, Stewart Brodie wrote:
> > >
> > > I know it's hard to see when written out textually, but note that 
> > > for the text node 'jkl', the I and B elements are the wrong way 
> > > around!
> > 
> > Wrong way with respect to what? They're the "right way" if you look at 
> > the end tags: </b> closes first, so it must be innermost! ;-)
> 
> I disagree because the 'jkl' is the bit I'm interested in here.  Are you
> saying that the desirable tree order in defined in terms only of the closing
> tags rather than the open tags?

No, I'm saying that it doesn't really matter. The content is malformed, so 
what we do with it doesn't really matter -- so long as it is well-defined, 
works with existing content, and isn't an undue burden on implementations 
for the correct case and the common case (if that's not the correct case).

> In the original source, there haven't been any close tags at all at the 
> time the 'jkl' is parsed, ignoring the other text nodes, the tree is:
> 
> <DIV> <B> <I> <P> jkl
>
> (I don't really like the P being there, though, to be honest).

What would you do instead? (Considering the performance concerns given 
below?)

> At this point, jkl has a logical element hierarchy above it in the DOM 
> tree that matches what was in the original HTML source.  In CSS selector 
> terms, "DIV > B > I".  The subsequent processing of the </B> token 
> causes such a selector to no longer match (it has now changed to "DIV > 
> I > B"):
> 
> <DIV> <B> <I> </I> </B> <P> <I> <B> jkl
> 
> Surely it is reasonable to expect the jkl to retain its ancestry - i.e. 
> be a child of the cloned I, which is a child of the cloned B, regardless 
> of the tag closure (of the B) that's about to occur, which would convert 
> it to ...
> 
> <DIV> <B> <I> </I> </B> <P> <B> <I> jkl </I> </B> <I> (mno...)
> 
> I suppose the root of my concern is how to apply CSS selector matching 
> in a reasonable looking manner to the DOM tree if the parser has 
> reversed the parentage of the formatting elements.

The entire basis of the Adoption Agency algorithm is that in fact the 
ancestry is not kept. I don't know of an alternative that works in as many 
cases. I agree that it isn't optimal, but the problem is that the input is 
ill-formed in the first place, so any attempt to make it into a tree will 
be flawed in some way.

> > It gets more obviously bad to do what Mozilla does when you consider a 
> > case like:
> > 
> >    <b><p>...<p>...<p>...<p>...<p>...<p>...
> > 
> > ...which is very common. With that exact markup, Safari, IE7, and the 
> > spec all end up with the exact same DOM tree (from the <body> down, at 
> > least), and with the same number of element nodes (from <body> down, 
> > 8).
> > 
> > Mozilla ends up with 13 nodes (from the body down). That doesn't scale 
> > -- there are pages with hundreds of nodes like this.
> 
> And it gets much worse if it was all wrapped in a <u> and <em> too. The 
> key is, as you mention in one of the blog entries linked below, that the 
> behaviour differs depending on whether or not the content is well-formed 
> in terms of matching order of start and end tags, or not.

In the Mozilla case, it depends on more than just whether the document is 
well-formed -- it depends on where the TCP packet boundaries lie. This is, 
IMHO, completely unacceptable, far less acceptable than moving the nodes 
around after their birth.

> I just don't like the idea of having to detach nodes from the DOM tree 
> once they have been attached.  The current algorithm is to allow any 
> element inside any other (pretty much) until a problem crops up at which 
> point there's a reorganisation required and that requires detachment 
> (almost always)

Right. I'm not a huge fan of it either, but it works (Safari does it), and 
it doesn't have the (IMHO much worse) problems that the other algorithms 
have.

Note that it actually is compatible with non-tree parsing modes (where the 
parser doesn't construct a DOM but instead marks the start and end of each 
tag, with tags possibly overlapping). The handling of broken content in 
<table> tags isn't. This, to me, is a much worse situation to be in, and 
there we really have no choice (all browsers are basically interoperable 
on that case).

> > > The problem here may simply be that appending any node due to 
> > > opening any non-formatting/non-phrasing open tag when in "in body" 
> > > should cause any formatting/phrasing elements to be popped off the 
> > > stack of open elements, and then NOT execute "reconstruct the active 
> > > formatting elements" (because it'll be executed automatically when 
> > > opening the next formatting/phrasing element or text node anyway)
> > 
> > Isn't that already the case? You only reconstruct for inline elements 
> > and text nodes, as far as I can tell.
> 
> No, on both counts.  Firstly, you just append the new node regardless of 
> what's already on the stack; secondly, the algorithm as stated causes 
> the reconstruction to happen for P too.  That may be an error?

I don't understand what you are describing here. Could you explain 
further?

> I'm also wondering about a change of behaviour for the formatting 
> elements that would remove the additional child-less I clone that ends 
> up under the DIV.  This is doable, but it leads to some additional 
> complexity in the handling of the list of active formatting elements.  
> The change would be that an open tag for a,b,big... does NOT reconstruct 
> and does NOT insert an HTML element for the token.  Instead, it creates 
> the node for the token and appends it to the list of active formatting 
> elements.  In other words, its creation is deferred until a suitable 
> point in the future.  Thus a reconstruct would create it (and the node 
> would have to be copied into the stack of open elements rather than a 
> new one created for these cases).  An attempt to remove it from the AFE 
> list would also create it (and then remove it again immediately).  
> However, I think that this may affect the start and end tag handling for 
> many other elements too, so may not be worthwhile - I haven't gone 
> through the idea in detail.

As I understand it, this would break cases like:

   <b> <p> ... </p> </b>

...where you absolutely must have the DOM:

   |
   +-- B
       |
       +-- P

...for compatiblity with (all) existing browsers.

> > BTW while looking at this stuff this page may be of use:
> > 
> >    http://software.hixie.ch/utilities/js/live-dom-viewer/
> 
> Now I'll have to work out why that doesn't work in my browser ;-)

Let me know if it's my fault.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'