[whatwg] StringEncoding open issues
jonas at sicking.cc
Fri Aug 17 00:23:18 PDT 2012
On Tue, Aug 14, 2012 at 10:34 AM, Joshua Bell <jsbell at chromium.org> wrote:
> On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn at zewt.org> wrote:
>> I agree with Jonas that encoding should just use a replacement character
>> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
>> other modes (eg. exceptions and user-specified replacement characters)
>> until there's a clear need.
>> My intuition is that encoding DOMString to UTF-16 should never have errors;
>> if there are dangling surrogates, pass them through unchanged. There's no
>> point in using a placeholder that says "an error occured here", when the
>> error can be passed through in exactly the same form (not possible with eg.
>> DOMString->SJIS). I don't feel strongly about this only because outputting
>> UTF-16 is so rare to begin with.
>> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell at chromium.org> wrote:
>> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
>> > the byte order mark (the encoding-specific serialization of U+FEFF).
>> This rarely detects the wrong type, but that doesn't mean it's not the
>> wrong answer. If my input is meant to be UTF-8, and someone hands me
>> BOM-marked UTF-16, I want it to fail in the same way it would if someone
>> passed in SJIS. I don't want it silently translated.
>> On the other hand, it probably does make sense for UTF-16 to switch to
>> UTF-16BE, since that's by definition the original purpose of the BOM.
>> The convention iconv uses, which I think is a useful one, is decoding from
>> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
>> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".
> Let me take a crack at making this into an algorithm:
> In the TextDecoder constructor:
> - If encoding is not specified, set an internal useBOM flag
> - If encoding is specified and is a case insensitive match for "utf-16"
> set an internal useBOM flag.
> NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
> specified the flag is not set.
> When decode() is called
> - If useBOM is set and the stream offset is 0, then
> - If there are not enough bytes to test for a BOM then return without
> emitting anything (NOTE: if not streaming an EOF byte would be present in
> the stream which would be a negative match for a BOM)
> - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
> 0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
> advance the stream past the BOM. The current encoding is used until the
> stream is reset.
> - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
> 0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
> respectively and advance the stream past the BOM. The current encoding is
> used until the stream is reset.
This doesn't sound right. The effect of the rules so far would be that
if you create a decoder and specify "utf-16" as encoding, and the
first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
More information about the whatwg