[whatwg] Surrogate pairs and character references
Ian Hickson
ian at hixie.ch
Wed Sep 16 02:40:42 PDT 2009
On Tue, 15 Sep 2009, Øistein E. Andersen wrote:
> >
> > I suppose we could just change the spec and say that surrogate
> > characters (whether literal characters, e.g. in UTF-8, or from
> > character references) all get converted to U+FFFD?.
>
> That seems to be the only reasonable option if handling ��
> as U+FFFD U+FFFD is deemed desirable and sufficiently compatible with
> existing documents. It would simplify things a bit in non-UTF-16
> environments (as compared to my interpretation of the current text)
> without much added complexity in UTF-16 environments.
Ok, done.
> > The spec says "Bytes or sequences of bytes in the original byte stream
> > that could not be converted to Unicode characters must be converted to
> > U+FFFD REPLACEMENT CHARACTER code points".
>
> I take it you mean that \xD800� should turn into \xFFFD� at this
> point, which is only supported by the quoted text if "bytes or sequences of
> bytes" representing surrogates "[cannot] be converted to Unicode characters"
> or, to put it differently, if surrogates are not "Unicode characters".
Correct. Surrogates aren't Unicode characters.
> Unfortunately for this reading, the term "Unicode character" does not
> seem to be defined in HTML5 or in Unicode,
I've added a definition to HTML5. The proper Unicode term is "Unicode
scalar value", apparently.
> and the following paragraph (which appears shortly after the one you
> quoted) clearly includes surrogate code points within the concept of
> "Unicode character":
>
> "Any occurrences of any characters in the ranges [...] U+D800 to U+DFFF,
> [...] are parse errors. (These are all control characters or permanently
> undefined Unicode characters.)"
>
> Moreover, this paragraph would be pointless if the characters mentioned
> therein could never occur at all.
I've changed the text to refer to "code points" when it talks about
surrogate code points.
> The use of "Unicode character" without a definition is fine in other
> parts of HTML5, but clearly not sufficiently precise in this instance.
> If you want to exclude (unpaired) surrogate code points only, the
> appropriate term to use would probably be "Unicode scalar value".
Yeah. Fixed.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list