[whatwg] Encoding: lone surrogates and utf-8, utf-16be, and utf-16le encoders

Wed Sep 4 11:38:44 PDT 2013

On Wed, Sep 4, 2013 at 4:36 AM, Anne van Kesteren <annevk at annevk.nl> wrote:
> The way the utf-8, utf-16be, and utf-16le encoders are written is that
> they accept code points (not code units). If the code points are in
> the surrogate range, they raise an error.
>
> That seems problematic. Encoders for utf-8 and utf-16be, and utf-16le
> are assumed to be safe, because you typically forget about lone
> surrogates.
>
> The API deals with this by having the [EnsureUTF16] flag which
> converts lone surrogates into U+FFFD. So by the time code points hit
> the encoder they're no longer in the lone surrogate range.
>
> Gecko however has not implemented this for utf-16be and utf-16be, but
> has for utf-8. (Or maybe the utf-8 encoder is better.) For now I'll
> assume this is a bug in Gecko.
>
>
> I can see several options for potentially improving this setup, but I
> need some feedback before going there:
>
> 1. Require Unicode scalar value input for encoders, and guarantee it
> as decoder output.
> 2. Change the utf-8, utf-16be, and utf-16le encoders to emit the byte
> sequence for U+FFFD rather than raise an error for input in the lone
> surrogate range. This would simplify the API and other callers to the
> utf-8, utf-16be, and utf-16le encoders as they no longer need to worry
> about them terminating with failure.
> 3. Move towards defining the entire platform in terms of 16-bit code
> units and forget about the nicer theoretical model of Unicode scalar
> values.

I prefer option 2 - CSS is now defined to do the same thing when
parsed (nulls, lone surrogates, and out-of-range codepoints are all
converted to u+fffd).

~TJ