[whatwg] Null characters

Mon Oct 8 20:36:29 PDT 2012

On Tue, 9 Oct 2012, Cameron Zemek wrote:
>
> I noticed the specification usually treats null characters U+0000 by 
> replacing them with the replacement character U+FFFD . The other cases 
> it will be ignored by the tree construction stage when the mode is 'in 
> body', 'in table text', 'in select'.
> 
> Would it not be simpler and more consistent to just have the Input 
> Stream Preprocessor replace all null characters with the replacement 
> character.

Yes. In fact that's what the spec used to do.

Turns out it's not Web-compatible. :-(

> If the Input Stream Preprocessor convert them it would result in minimal 
> changes to the output as I believe most HTML documents in the wild do 
> not include null characters.

This assumption is unfortunately less correct than one would hope.

> On a similar note why have the other invalid unicode characters,
> U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to
> U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
> U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
> U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
> U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
> U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
> U+10FFFE, and U+10FFFF
> as part of the input stream to the tokenizer and tree construction?

It's my understanding that this test (and others for the other characters) 
is required for compatibility with legacy content to log "1":

   http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1824

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'