[whatwg] [wa1] Status of tree construction section

Wed Jul 12 03:32:53 PDT 2006

Ian Hickson <ian at hixie.ch> wrote:

> On Mon, 10 Jul 2006, Stewart Brodie wrote:
> > 
> > In the main phase, section 'If the insertion mode is "in row"', the last

> > option for 'anything else' says "process ... as if ... in table".  I 
> > think that should say "as if ... in table body" instead.  That case will

> > re-throw the token out to "in table" in any case if it doesn't handle 
> > it.
> 
> There'd be no difference. Any token that isn't handled by the "row" mode 
> will not be handled by the "table body" mode.

Yes, I noticed that and agree, except that it just seemed to me that it
would be more natural to expect unhandled things to be thrown to the next
level of scope (table body) rather than bypassing it and going directly to
the table.

> > I've come to the conclusion that you need pictures to accompany the 
> > "adoption agency algorithm".  However, I'm not an artist.  Indeed, I'm 
> > so bad at drawing pictures, that in the past, users often sent me 
> > replacement bitmap graphics for my programs because they found my 
> > attempts so distressing :-)
> 
> Yeah, I completely agree. Diagrams and examples. If someone wants to do a 
> diagram here I'd be most happy. Failing that, I'll probably get around to 
> it in due course (e.g. once I'm convinced it actually works).

It is the most complex part of the tree construction.  Perhaps in lieu of
pictures in the short term, a short non-normative summary could be added
describing what the algorithm is doing, because reverse engineering it from
the 14-step plan is hard.

> I think I radically rewrote that step since you last looked at it, because

> your comment above doesn't match the current text. I found a massive 
> glaring bug in the algorithm about a week or two ago that I fixed which 
> required a big rewrite of step 1 of that algorithm, so that may be why.

Yes, OK.  Now that my company's firewall allows it, I can get at the history
and update it all.  The current text (12 July) is much better.

> > [Suggested common code]

> What would this replace in the current text?

The now-removed part of the old step 1 of the adoption agency algorithm, so
it's no longer relevant.

> > The "parsing quirks" box lists several issues that I think are
> > important. The <script> one in particular is so very common.
> > Unfortunately, I had to cave in eventually and support that because it
> > broke some customers' own sites.
> 
> Can you describe what exactly the quirk is? I have yet to see an
> algorithmic description of how to parse <script> blocks in quirks mode. In
> my research and the research that other people have done, it was found
> that every UA does it slightly differently. This is why I'd really rather
> not do this.  If you can tell me exactly what it is, I might be more
> convinced to do it.

Yes, it's hard to pin down.  In effect, it's a new value for the content
model flag which is like some sort of combination of RCDATA and PLAINTEXT. 
I'm not sure it's just a quirk, to be honest.  I've tried the following
snippet in Firefox, Opera & IE6 and they behave the same way regardless of
the presence of a strict HTML4 doctype declaration before the <html>

<html><title>The <!-- comment with a </title> in the --> title</title><body
onload="document.body.appendChild(document.createTextNode(document.title))">

In all cases, the window title and the text shown in the document body was:
  The <!-- comment with a </title> in the --> title

The same behaviour appears to apply to TEXTAREA, SCRIPT, NOSCRIPT, NOFRAMES,
NOEMBED.  STYLE works differently in Firefox (it thinks that the content
property's value terminates the style tag:

  <style> <!-- h1:after { content: '</style>'; color: red } --> </style>

The rule seems to be that whilst you are lexing the contents of one of these
magical elements, you have an additional flag, initialised to false, that
indicates that you are inside an pseudo-comment.  You continue to accumulate
character tokens, but if you see the sequence ,
you set the flag to false.  Whilst the flag is true, finding the < does not
switch to the open tag state.  The character tokens are all accumulated into
the content of the element, regardless of whether they match the 
markers.

> > I have come across never-opened </br> and </p> too.
> 
> I'm currently doing a study to determine how common these are. Preliminary

> results suggest they are indeed far too common to be left out.
> 
> 
> > I've never heard of <% ... %> before.  Sometimes, it's really quite 
> > depressing the rubbish that people (and programs!) write out.
> 
> Seriously.

No, I've never come across it.  Co-workers tell me it's some sort of
server-side thing, which would explain why I've never come across it.

> > I spent a long time trying to work out what I needed to store for each 
> > entry on both the stack of open elements and the list of active 
> > formatting elements.  I think it should be stated up front because this 
> > is often an area of confusion, in my experience.  I frequently get upset

> > with co-workers over misuse of the terms "element", "tag" and "node", 
> > for example :-)
> 
> What do you think you have to store? I'm not sure what answer would 
> satisfy you here.

I think it's probably more confusion on my part.  I actually use two
separate data structures whilst building the tree: a stack of tag names and
the DOM tree itself.  The tree building code itself expects a series of
callbacks that describe a well-formed tree, and it constructs the DOM tree
using little more than the current node for its state (it handles a stream
of tokens from the XML parser too).  The HTML parser delivers its tokens to
this intermediate code that attempts to re-order things into a well-formed
tree and then pass them on.  For the more part that code doesn't need to
deal with node objects at all - just the tag names.  To implement the
algorithms in WA1, I think they have to be much less independent.

> > Finally (for now ;-), right at the beginning of the tree construction 
> > section, it says that DOM Mutation events must not fire for changes 
> > caused by the UA parsing the document.  I cannot decide whether or not I

> > agree with that statement.  My experimentation appears to show that this

> > is indeed what happens in Firefox, at least. I put a script in the head 
> > of my document that attaches a listener for DOMNodeInserted on the 
> > document.documentElement node (i.e. the HTML element) and it never gets 
> > called due to nodes being added by the parser.  Internally, for me, it's

> > a PITA though, because my node tree construction code and DOM 
> > implementation code use the same internal APIs - and these automatically

> > trigger the DOM events, which, in turn, get dispatched to the various 
> > internal default event handlers to deal with the special types of node 
> > that require additional behaviour (like IMG, LINK, META etc.).
> 
> In Web browsers it's simply not an option. Having to fire mutation events 
> for every mutation according to the complete DOM3 Events model is 
> prohibitively expensive.

To be honest, I've not found it a burden even on the sorts of low-end
devices that our software runs (typically 300MHz CPUs, 8MB RAM, that sort of
thing)  Then again, I have a highly optimised event dispatcher that takes
steps to minimise the work, particularly when there are no DOM listeners for
the event being raised, which will almost always be the case for the events
concerned (DOMNodeInserted and DOMNodeInsertedIntoDocument and the Removed
counterparts).  The internal default event handlers have similar filtering
to eliminate any unnecessary processing quickly.

In the "in body" section, WBR doesn't really belong with a,b,big,em...
because it never had content.  It probably ought to go in with
area,basefont,bgsound... a bit further down, or in its own section.  There's
no real point bothering with putting it in the list of active formatting
elements so it's coming off the stack again straight away.

-- 
Stewart Brodie
Software Engineer
ANT Software Limited