[whatwg] Surrogate pairs and character references

Wed Sep 16 17:38:05 PDT 2009

It is much clearer now.  Thanks.  Just a few minor issues:

> "Bytes or sequences of bytes in the original byte stream that could  
> not be converted to Unicode characters must be converted to U+FFFD  
> REPLACEMENT CHARACTER code points."

With the new definition of Unicode characters as Unicode scalar  
values, this excludes surrogate code points, which are also handled  
separately (and cause a parse error) in the step quoted below.  You  
may want to say "Unicode code points" rather than "Unicode characters".

"U+FFFD REPLACEMENT CHARACTERs" is sufficient, used elsewhere and  
probably reads better than "U+FFFD REPLACEMENT CHARACTER code points".
> All U+0000 NULL characters and code points in the range U+D800 to U 
> +DFFF in the input must be replaced by U+FFFD REPLACEMENT  
> CHARACTERs. Any occurrences of such characters and code points are  
> parse errors.
>
The phrase "characters and code points" (in the second sentence) is  
awkward given that all characters are in fact code points.

-- 
Øistein E. Andersen