[whatwg] Surrogate pairs and character references

Ian Hickson ian at hixie.ch
Thu Sep 24 02:06:13 PDT 2009


On Thu, 17 Sep 2009, Øistein E. Andersen wrote:
>
> It is much clearer now.  Thanks.  Just a few minor issues:
> 
> > "Bytes or sequences of bytes in the original byte stream that could not be
> > converted to Unicode characters must be converted to U+FFFD REPLACEMENT
> > CHARACTER code points."
> 
> With the new definition of Unicode characters as Unicode scalar values, this
> excludes surrogate code points, which are also handled separately (and cause a
> parse error) in the step quoted below.  You may want to say "Unicode code
> points" rather than "Unicode characters".

Fixed.


> "U+FFFD REPLACEMENT CHARACTERs" is sufficient, used elsewhere and probably
> reads better than "U+FFFD REPLACEMENT CHARACTER code points".

Fixed.


> > All U+0000 NULL characters and code points in the range U+D800 to U+DFFF in
> > the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences
> > of such characters and code points are parse errors.
> 
> The phrase "characters and code points" (in the second sentence) is awkward
> given that all characters are in fact code points.

Yeah, but if I change it it sounds even more awkward because then it 
doesn't match the previous sentence. I'd rather have it be technically 
redundant than confuse people into thinking that I meant something more 
subtle than the spec actually says.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list