[whatwg] Surrogate pairs and character references

Wed Sep 16 02:40:42 PDT 2009

On Tue, 15 Sep 2009, Øistein E. Andersen wrote:
> > 
> > I suppose we could just change the spec and say that surrogate 
> > characters (whether literal characters, e.g. in UTF-8, or from 
> > character references) all get converted to U+FFFD?.
> 
> That seems to be the only reasonable option if handling &#xD800;&#xDC00; 
> as U+FFFD U+FFFD is deemed desirable and sufficiently compatible with 
> existing documents.  It would simplify things a bit in non-UTF-16 
> environments (as compared to my interpretation of the current text) 
> without much added complexity in UTF-16 environments.

Ok, done.

> > The spec says "Bytes or sequences of bytes in the original byte stream 
> > that could not be converted to Unicode characters must be converted to 
> > U+FFFD REPLACEMENT CHARACTER code points".
> 
> I take it you mean that \xD800&#xDC00; should turn into \xFFFD&#xDC00; at this
> point, which is only supported by the quoted text if "bytes or sequences of
> bytes" representing surrogates "[cannot] be converted to Unicode characters"
> or, to put it differently, if surrogates are not "Unicode characters".

Correct. Surrogates aren't Unicode characters.

> Unfortunately for this reading, the term "Unicode character" does not 
> seem to be defined in HTML5 or in Unicode,

I've added a definition to HTML5. The proper Unicode term is "Unicode 
scalar value", apparently.

> and the following paragraph (which appears shortly after the one you 
> quoted) clearly includes surrogate code points within the concept of 
> "Unicode character":
> 
> "Any occurrences of any characters in the ranges [...] U+D800 to U+DFFF, 
> [...] are parse errors. (These are all control characters or permanently 
> undefined Unicode characters.)"
> 
> Moreover, this paragraph would be pointless if the characters mentioned 
> therein could never occur at all.

I've changed the text to refer to "code points" when it talks about 
surrogate code points.

> The use of "Unicode character" without a definition is fine in other 
> parts of HTML5, but clearly not sufficiently precise in this instance. 
> If you want to exclude (unpaired) surrogate code points only, the 
> appropriate term to use would probably be "Unicode scalar value".

Yeah. Fixed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'