[whatwg] Surrogate pairs and character references

Tue Sep 15 14:14:40 PDT 2009

On 15 Sep 2009, at 03:03, Ian Hickson wrote:

> [...] [R]egardless of the environment, &#xD800;&#xDC00; and  
> &#x10000; are not the
> same -- the first has two invalid characters U+D800 and U+DC00, the  
> second
> has one character U+10000.

That works well as long as the concepts are couched in abstract terms,  
but how is one expected to prevent adjacent surrogates from coalescing  
in a UTF-16 environment?

Firefox circumvents the problem by substituting U+FFFD for the  
surrogates; other browsers make no attempt to prevent coalescence.

> I'm not really sure how to make that clearer in the spec.

(Let us first determine what should happen.)

> I suppose we
> could just change the spec and say that surrogate characters (whether
> literal characters, e.g. in UTF-8, or from character references) all  
> get
> converted to U+FFFD?.

That seems to be the only reasonable option if handling  
&#xD800;&#xDC00; as U+FFFD U+FFFD is deemed desirable and sufficiently  
compatible with existing documents.  It would simplify things a bit in  
non-UTF-16 environments (as compared to my interpretation of the  
current text) without much added complexity in UTF-16 environments.

> [\xD800&#xDC00;] should give U+FFFD U+DC00. It's not clear to me why  
> that is not clear. :-)
> Could you walk me through the spec interpreting it in such a way  
> that you
> get any other result?

See below.

> The spec says "Bytes or sequences of bytes in the original byte stream
> that could not be converted to Unicode characters must be converted to
> U+FFFD REPLACEMENT CHARACTER code points".

I take it you mean that \xD800&#xDC00; should turn into \xFFFD&#xDC00;  
at this point, which is only supported by the quoted text if "bytes or  
sequences of bytes" representing surrogates "[cannot] be converted to  
Unicode characters" or, to put it differently, if surrogates are not  
"Unicode characters".

Unfortunately for this reading, the term "Unicode character" does not  
seem to be defined in HTML5 or in Unicode, and the following paragraph  
(which appears shortly after the one you quoted) clearly includes  
surrogate code points within the concept of "Unicode character":

"Any occurrences of any characters in the ranges [...] U+D800 to U 
+DFFF, [...] are parse errors. (These are all control characters or  
permanently undefined Unicode characters.)"

Moreover, this paragraph would be pointless if the characters  
mentioned therein could never occur at all.

***

The use of "Unicode character" without a definition is fine in other  
parts of HTML5, but clearly not sufficiently precise in this instance.  
If you want to exclude (unpaired) surrogate code points only, the  
appropriate term to use would probably be "Unicode scalar value".

-- 
Øistein E. Andersen