[whatwg] Surrogate pairs and character references
Øistein E. Andersen
liszt at coq.no
Wed Sep 16 17:38:05 PDT 2009
It is much clearer now. Thanks. Just a few minor issues:
> "Bytes or sequences of bytes in the original byte stream that could
> not be converted to Unicode characters must be converted to U+FFFD
> REPLACEMENT CHARACTER code points."
With the new definition of Unicode characters as Unicode scalar
values, this excludes surrogate code points, which are also handled
separately (and cause a parse error) in the step quoted below. You
may want to say "Unicode code points" rather than "Unicode characters".
"U+FFFD REPLACEMENT CHARACTERs" is sufficient, used elsewhere and
probably reads better than "U+FFFD REPLACEMENT CHARACTER code points".
> All U+0000 NULL characters and code points in the range U+D800 to U
> +DFFF in the input must be replaced by U+FFFD REPLACEMENT
> CHARACTERs. Any occurrences of such characters and code points are
> parse errors.
>
The phrase "characters and code points" (in the second sentence) is
awkward given that all characters are in fact code points.
--
Øistein E. Andersen
More information about the whatwg
mailing list