[whatwg] Null characters

Ian Hickson ian at hixie.ch
Tue Oct 9 11:47:21 PDT 2012

On Tue, 9 Oct 2012, Cameron Zemek wrote:
> On Tue, Oct 9, 2012 at 1:36 PM, Ian Hickson <ian at hixie.ch> wrote:
> > On Tue, 9 Oct 2012, Cameron Zemek wrote:
> >>
> >> I noticed the specification usually treats null characters U+0000 by 
> >> replacing them with the replacement character U+FFFD . The other 
> >> cases it will be ignored by the tree construction stage when the mode 
> >> is 'in body', 'in table text', 'in select'.
> >>
> >> Would it not be simpler and more consistent to just have the Input 
> >> Stream Preprocessor replace all null characters with the replacement 
> >> character.
> >
> > Yes. In fact that's what the spec used to do.
> >
> > Turns out it's not Web-compatible. :-(
> How is it not web-compatible? PS: Maybe a note should be added to the 
> specification that explains this.

I could add a note... based on what Boris described, what would you want 
the note to say and where would you want it placed, such that you would 
have seen it when your original reading caused you to e-mail the list?

(This part of the spec is rather large, and the NULL handling happens all 
over the place, so I don't know where would be best.)

On Tue, 9 Oct 2012, Boris Zbarsky wrote:
> > 
> > But just thinking about it logically what issues would there be in 
> > showing Null character as the replacement character instead? Visually 
> > would see some extra characters if the document author had Null 
> > characters. What is the big deal with doing that?
> It makes text unreadable.  Consider text that's actually UTF-16 but 
> being declared as ISO-8859-1.  If you strip the nulls, it all works out.  
> But if you don't, every other character is a replacement character.
> This is not a rare situation on the web, unfortunately.
> > Why do authors even have null characters in their HTML documents?
> Because they have UTF-16 text in their database that they dump into an 
> ISO-8859-1 document.  They have no idea there are any "null characters" 
> involved.
> > I assume I'm probably missing some historical reason for this
> Yes, that reason is "the browsers all do it this way, so web sites 
> depend on it".

Yup. :-(

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list