[whatwg] NUL characters in CDATA?

Mon Oct 24 10:06:29 PDT 2011

I'm responding to my own email here...  Please disregard this thread: 
the bug was in my code, not the tests or the spec.  (The tokenizer 
leaves the NUL characters in CDATA sections, but when those characters 
are inserted into foreign content (as they always are) the NUL 
characters are transformed as the tests expect. My code wasn't doing the 
transformation correctly.)

     David

On 10/14/11 9:19 PM, David Flanagan wrote:
> The HTML parsing spec says this about tokenizing CDATA sections:
>
>> Consume every character up to the next occurrence of the three 
>> character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE 
>> BRACKET U+003E GREATER-THAN SIGN (|]]>|), or the end of the file 
>> (EOF), whichever comes first. Emit a series of character tokens 
>> consisting of all the characters consumed except the matching three 
>> character sequence at the end (if one was found before the end of the 
>> file).
> By my reading, if there are NUL \u0000 characters in the input inside 
> a CDATA section they will be left unchanged.
>
> But the html5lib test suite includes this test case 
> testdata/tree-construction/plain-text-unsafe.dat:
>
> #data
> <svg><![CDATA[\u0000filler\u0000text\u0000]]>
> #errors
> #document
> | <html>
> | <head>
> | <body>
> | <svg svg>
> |       "\uFFFDfiller\uFFFDtext\uFFFD"
>
> In order to copy this test into my email window, I had to change the 
> non-printing characters to Unicode \u escapes, but this is the basic 
> test data and it seems to contradict the spec.
>
> Which is right?  Should the spec be modified so that the CDATA section 
> state is like the bogus comment state and includes the text " with any 
> U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER 
> characters."?
>
> Thanks,
>
>     David