[whatwg] StringEncoding open issues

Fri Aug 17 00:23:18 PDT 2012

On Tue, Aug 14, 2012 at 10:34 AM, Joshua Bell <jsbell at chromium.org> wrote:
> On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn at zewt.org> wrote:
>
>> I agree with Jonas that encoding should just use a replacement character
>> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
>> other modes (eg. exceptions and user-specified replacement characters)
>> until there's a clear need.
>>
>> My intuition is that encoding DOMString to UTF-16 should never have errors;
>> if there are dangling surrogates, pass them through unchanged.  There's no
>> point in using a placeholder that says "an error occured here", when the
>> error can be passed through in exactly the same form (not possible with eg.
>> DOMString->SJIS).  I don't feel strongly about this only because outputting
>> UTF-16 is so rare to begin with.
>>
>> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell at chromium.org> wrote:
>>
>> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
>> > the byte order mark (the encoding-specific serialization of U+FEFF).
>>
>>
>> This rarely detects the wrong type, but that doesn't mean it's not the
>> wrong answer.  If my input is meant to be UTF-8, and someone hands me
>> BOM-marked UTF-16, I want it to fail in the same way it would if someone
>> passed in SJIS.  I don't want it silently translated.
>>
>> On the other hand, it probably does make sense for UTF-16 to switch to
>> UTF-16BE, since that's by definition the original purpose of the BOM.
>>
>> The convention iconv uses, which I think is a useful one, is decoding from
>> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
>> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".
>
>
> Let me take a crack at making this into an algorithm:
>
> In the TextDecoder constructor:
>
>    - If encoding is not specified, set an internal useBOM flag
>    - If encoding is specified and is a case insensitive match for "utf-16"
>    set an internal useBOM flag.
>
> NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
> specified the flag is not set.
>
> When decode() is called
>
>    - If useBOM is set and the stream offset is 0, then
>       - If there are not enough bytes to test for a BOM then return without
>       emitting anything (NOTE: if not streaming an EOF byte would be present in
>       the stream which would be a negative match for a BOM)
>       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
>       0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
>       advance the stream past the BOM. The current encoding is used until the
>       stream is reset.
>       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
>       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
>       respectively and advance the stream past the BOM. The current encoding is
>       used until the stream is reset.

This doesn't sound right. The effect of the rules so far would be that
if you create a decoder and specify "utf-16" as encoding, and the
first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
"utf-8" decoding.

/ Jonas