[imps] 24 June 2010 HTML 5 spec: bug when emitting tokenizer start tags

Ian Hickson ian at hixie.ch
Mon Aug 9 17:03:20 PDT 2010

On Wed, 23 Jun 2010, Rob Jellinghaus wrote:
> The 24 June 2010 working draft of the HTML5 spec has, I believe, a bug 
> with tokenizer state update when emitting start tags.  The bug is an 
> ordering problem between the tokenizer state update performed by the 
> tokenizer itself, and the tokenizer state update sometimes performed by 
> the tree construction stage.
> http://dev.w3.org/html5/spec/Overview.html currently links to 
> http://www.w3.org/TR/2010/WD-html5-20100624/ as the latest version, but 
> the latter link is broken at the moment.  Looking at the former, for 
> instance:
> Section (Tag name state) says
> 	Emit the current tag token. Switch to the data state.
> The "Emit the current tag token" step is defined in section 8.2.4 as:
> 	When a token is emitted, it must immediately be handled by the
> 	tree construction stage. The tree construction stage can affect
> 	the state of the tokenization stage, and can insert additional
> 	characters into the stream.
> So let us consider the following HTML:
> 	<html>
> 	<head>
> 	<script><!-- window.alert(); --></script>
> 	</head>
> 	<body></body>
> 	</html>
> At the closing '>' of '<script>', the tokenizer is in tag name state.  
> It emits the current tag token, which is a 'script' start tag.
> The tree construction stage, in section ("in head" insertion 
> mode), specifies:
> 	↪A start tag whose tag name is "script"
> 	Run these steps:
> 	...
> 	5.Switch the tokenizer to the script data state.
> The tree construction stage therefore resets the tokenizer state 
> immediately.
> After completing, the tree construction stage returns to the tokenizer.  
> *And at that point, the tokenizer is specified to reset to the data 
> state!* This state update overwrites the state update from the tree 
> construction stage, and the script is not parsed as script.
> I encountered this bug in my own implementation.  The identical bug 
> exists in all the other states that can emit start tags which can 
> contain content ( through, and

For the record, this was fixed a few weeks ago. Let me know if anything is 
still broken here.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the Implementors mailing list