[whatwg] Surrogate pairs and character references

Mon Sep 14 19:03:54 PDT 2009

On Tue, 8 Sep 2009, Øistein E. Andersen wrote:
>
> According to the spec, character references may cause surrogate characters
> (0xD800 to 0xDFFF) to be inserted into the DOM.  Assuming that the DOM is an
> UTF-16 environment, &#xD800;&#xDC00; and &#x10000; will both result in
> \xD800\xDC00 or U+1,0000.  This should probably be pointed out explicitly
> since extra processing has to be done to achieve the same result in a parser
> that is not built atop UTF-16.

Actually it's the other way around. Extra work has to be done in UTF-16 
environments to make sure that Unicode characters in the surrogate 
character range don't get processed as surrogate characters. (That is, 
regardless of the environment, &#xD800;&#xDC00; and &#x10000; are not the 
same -- the first has two invalid characters U+D800 and U+DC00, the second 
has one character U+10000.)

I'm not really sure how to make that clearer in the spec. I suppose we 
could just change the spec and say that surrogate characters (whether 
literal characters, e.g. in UTF-8, or from character references) all get 
converted to U+FFFD?.

> Furthermore, it is not entirely clear whether a mixed form like \xD800&#xDC00;
> encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD\xDC00.

It should give U+FFFD U+DC00. It's not clear to me why that is not clear. :-)
Could you walk me through the spec interpreting it in such a way that you 
get any other result?

> Not all browsers convert unpaired surrogates in UTF-16 to U+FFFD, so the 
> mixed form may be interpreted as U+1,0000.

The spec says "Bytes or sequences of bytes in the original byte stream 
that could not be converted to Unicode characters must be converted to 
U+FFFD REPLACEMENT CHARACTER code points".

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'