[whatwg] NUL characters in CDATA?

Fri Oct 14 21:19:11 PDT 2011

The HTML parsing spec says this about tokenizing CDATA sections:

> Consume every character up to the next occurrence of the three 
> character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE 
> BRACKET U+003E GREATER-THAN SIGN (|]]>|), or the end of the file 
> (EOF), whichever comes first. Emit a series of character tokens 
> consisting of all the characters consumed except the matching three 
> character sequence at the end (if one was found before the end of the 
> file).
By my reading, if there are NUL \u0000 characters in the input inside a 
CDATA section they will be left unchanged.

But the html5lib test suite includes this test case 
testdata/tree-construction/plain-text-unsafe.dat:

#data
<svg><![CDATA[\u0000filler\u0000text\u0000]]>
#errors
#document
| <html>
| <head>
| <body>
| <svg svg>
|       "\uFFFDfiller\uFFFDtext\uFFFD"

In order to copy this test into my email window, I had to change the 
non-printing characters to Unicode \u escapes, but this is the basic 
test data and it seems to contradict the spec.

Which is right?  Should the spec be modified so that the CDATA section 
state is like the bogus comment state and includes the text " with any 
U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER 
characters."?

Thanks,

     David