[imps] 24 June 2010 HTML 5 spec: bug when emitting tokenizer start tags

Rob Jellinghaus rjelling at microsoft.com
Wed Jun 23 11:54:34 PDT 2010


The 24 June 2010 working draft of the HTML5 spec has, I believe, a bug with tokenizer state update when emitting start tags.  The bug is an ordering problem between the tokenizer state update performed by the tokenizer itself, and the tokenizer state update sometimes performed by the tree construction stage.

http://dev.w3.org/html5/spec/Overview.html currently links to http://www.w3.org/TR/2010/WD-html5-20100624/ as the latest version, but the latter link is broken at the moment.  Looking at the former, for instance:

Section 8.2.4.10 (Tag name state) says

	↪U+003E GREATER-THAN SIGN (>)
	Emit the current tag token. Switch to the data state.

The "Emit the current tag token" step is defined in section 8.2.4 as:

	When a token is emitted, it must immediately be handled by the
	tree construction stage. The tree construction stage can affect
	the state of the tokenization stage, and can insert additional
	characters into the stream.

So let us consider the following HTML:

	<html>
	<head>
	<script><!-- window.alert(); --></script>
	</head>
	<body></body>
	</html>

At the closing '>' of '<script>', the tokenizer is in tag name state.  It emits the current tag token, which is a 'script' start tag.

The tree construction stage, in section 8.2.5.7 ("in head" insertion mode), specifies:

	↪A start tag whose tag name is "script"
	Run these steps:
	...
	5.Switch the tokenizer to the script data state.

The tree construction stage therefore resets the tokenizer state immediately.

After completing, the tree construction stage returns to the tokenizer.  *And at that point, the tokenizer is specified to reset to the data state!*  This state update overwrites the state update from the tree construction stage, and the script is not parsed as script.

I encountered this bug in my own implementation.  The identical bug exists in all the other states that can emit start tags which can contain content (8.2.4.34 through 8.2.4.37, and 8.2.4.42).

The fix is to reverse the order of the state update and the token emission:

	↪U+003E GREATER-THAN SIGN (>)
	Switch to the data state. Emit the current tag token.

I have applied this fix in my implementation and satisfied myself that it is (more) correct.  Please advise.

Sincerely, 
Rob Jellinghaus
rjelling at microsoft.com


More information about the Implementors mailing list