[whatwg] Null characters

Mon Oct 8 21:09:57 PDT 2012

On Tue, Oct 9, 2012 at 1:36 PM, Ian Hickson <ian at hixie.ch> wrote:
> On Tue, 9 Oct 2012, Cameron Zemek wrote:
>>
>> I noticed the specification usually treats null characters U+0000 by
>> replacing them with the replacement character U+FFFD . The other cases
>> it will be ignored by the tree construction stage when the mode is 'in
>> body', 'in table text', 'in select'.
>>
>> Would it not be simpler and more consistent to just have the Input
>> Stream Preprocessor replace all null characters with the replacement
>> character.
>
> Yes. In fact that's what the spec used to do.
>
> Turns out it's not Web-compatible. :-(

How is it not web-compatible? PS: Maybe a note should be added to the
specification that explains this.

>
>> If the Input Stream Preprocessor convert them it would result in minimal
>> changes to the output as I believe most HTML documents in the wild do
>> not include null characters.
>
> This assumption is unfortunately less correct than one would hope.

Yeah I don't have any numbers to see if this is the case or not. Would
be interesting to see some study done on this. But just thinking about
it logically what issues would there be in showing Null character as
the replacement character instead? Visually would see some extra
characters if the document author had Null characters. What is the big
deal with doing that? Why do authors even have null characters in
their HTML documents? Should they not be removing them?

I assume I'm probably missing some historical reason for this, its
just struck me as needless complexity. In other words, what good
reasons exist for ignoring null characters in certain portions of the
HTML specification?