[whatwg] Surrogate pairs and character references
ian at hixie.ch
Mon Sep 14 19:03:54 PDT 2009
On Tue, 8 Sep 2009, Øistein E. Andersen wrote:
> According to the spec, character references may cause surrogate characters
> (0xD800 to 0xDFFF) to be inserted into the DOM. Assuming that the DOM is an
> UTF-16 environment, �� and 𐀀 will both result in
> \xD800\xDC00 or U+1,0000. This should probably be pointed out explicitly
> since extra processing has to be done to achieve the same result in a parser
> that is not built atop UTF-16.
Actually it's the other way around. Extra work has to be done in UTF-16
environments to make sure that Unicode characters in the surrogate
character range don't get processed as surrogate characters. (That is,
regardless of the environment, �� and 𐀀 are not the
same -- the first has two invalid characters U+D800 and U+DC00, the second
has one character U+10000.)
I'm not really sure how to make that clearer in the spec. I suppose we
could just change the spec and say that surrogate characters (whether
literal characters, e.g. in UTF-8, or from character references) all get
converted to U+FFFD?.
> Furthermore, it is not entirely clear whether a mixed form like \xD800�
> encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD\xDC00.
It should give U+FFFD U+DC00. It's not clear to me why that is not clear. :-)
Could you walk me through the spec interpreting it in such a way that you
get any other result?
> Not all browsers convert unpaired surrogates in UTF-16 to U+FFFD, so the
> mixed form may be interpreted as U+1,0000.
The spec says "Bytes or sequences of bytes in the original byte stream
that could not be converted to Unicode characters must be converted to
U+FFFD REPLACEMENT CHARACTER code points".
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg