[whatwg] Surrogate pairs and character references
Øistein E. Andersen
liszt at coq.no
Tue Sep 8 15:39:03 PDT 2009
According to the spec, character references may cause surrogate
characters (0xD800 to 0xDFFF) to be inserted into the DOM. Assuming
that the DOM is an UTF-16BE environment, �� and
𐀀 will both result in \xD800\xDC00 or U+1,0000. This should
probably be pointed out explicitly since extra processing has to be
done to achieve the same result in a parser that is not built atop
UTF-16BE.
Furthermore, it is not entirely clear whether a mixed form like
\xD800� encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD
\xDC00. Not all browsers convert unpaired surrogates in UTF-16 to U
+FFFD, so the mixed form may be interpreted as U+1,0000.
--
Øistein E. Andersen
More information about the whatwg
mailing list