[whatwg] StringEncoding open issues

Tue Aug 14 10:34:51 PDT 2012

On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn at zewt.org> wrote:

> I agree with Jonas that encoding should just use a replacement character
> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
> other modes (eg. exceptions and user-specified replacement characters)
> until there's a clear need.
>
> My intuition is that encoding DOMString to UTF-16 should never have errors;
> if there are dangling surrogates, pass them through unchanged.  There's no
> point in using a placeholder that says "an error occured here", when the
> error can be passed through in exactly the same form (not possible with eg.
> DOMString->SJIS).  I don't feel strongly about this only because outputting
> UTF-16 is so rare to begin with.
>
> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell at chromium.org> wrote:
>
> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> > the byte order mark (the encoding-specific serialization of U+FEFF).
>
>
> This rarely detects the wrong type, but that doesn't mean it's not the
> wrong answer.  If my input is meant to be UTF-8, and someone hands me
> BOM-marked UTF-16, I want it to fail in the same way it would if someone
> passed in SJIS.  I don't want it silently translated.
>
> On the other hand, it probably does make sense for UTF-16 to switch to
> UTF-16BE, since that's by definition the original purpose of the BOM.
>
> The convention iconv uses, which I think is a useful one, is decoding from
> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".

Let me take a crack at making this into an algorithm:

In the TextDecoder constructor:

   - If encoding is not specified, set an internal useBOM flag
   - If encoding is specified and is a case insensitive match for "utf-16"
   set an internal useBOM flag.

NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
specified the flag is not set.

When decode() is called

   - If useBOM is set and the stream offset is 0, then
      - If there are not enough bytes to test for a BOM then return without
      emitting anything (NOTE: if not streaming an EOF byte would be present in
      the stream which would be a negative match for a BOM)
      - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
      0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
      advance the stream past the BOM. The current encoding is used until the
      stream is reset.
      - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
      0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
      respectively and advance the stream past the BOM. The current encoding is
      used until the stream is reset.
   - Otherwise, if useBOM is not set and the steam offset is 0, then if the
   encoding is "utf-8", "utf-16" or "utf-16be"
      - If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF
      then let detected encoding be "utf-16", "utf-16be" or "utf-8"
respectively.
      If the detected encoding matches the object's encoding, advance
the stream
      past the BOM. Otherwise, if the fatal flag is set then throw a
      "EncodingError" DOMException. Otherwise, the decoding algorithm proceeds.
      - If there are not enough bytes to test for a BOM then return without
      emitting anything (NOTE: if not streaming an EOF byte would be inserted
      which would be a negative match for a BOM)

Working the "current encoding" switcheroo into the spec will require some
refactoring, so trying to get consensus here first.

In English:

   - Create an encoder with TextDecoder() and if present a BOM will be
   respected (and consumed) otherwise default to UTF-8
   - Create an encoder with TextDecoder("utf-16") and either UTF-16LE or
   UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE
   (which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present)
   - Create an encoder with TextDecoder("utf-8",
   {fatal:true}), TextDecoder("utf-16le", {fatal:true}),
   TextDecoder("utf-16be", {fatal:true}) and a matching BOM will be consumed,
   a mismatching BOM will throw an EncodingError
   - Create an encoder with TextDecoder("utf-8"), TextDecoder("utf-16le"),
   TextDecoder("utf-16be") and a matching BOM will be consumed, a mismatching
   BOM will be blithely decoded (probably giving you replacement characters),
   but not throwing.

 * If one of the UTF encodings is specified AND the BOM matches then the
> > leading BOM character (U+FEFF) MUST NOT be emitted in the output
> character
> > sequence (i.e. it is silently consumed)
> >
>
> It's a little weird that
>
> data = readFile("user-supplied-file.txt"); // shortcutting for brevity
> var s = new TextDecoder("utf-16").decode(data); // or utf-8
> s = s.replace("a", "b");
> var data2 = new TextEncoder("utf-16").encode(s);
> writeFile("user-supplied-file.txt", data2);
>
> causes the BOM to be quietly stripped away.  Normally if you're modifying a
> file, you want to pass through the BOM (or lack thereof) untouched.
>
> One way to deal with this could be:
>
> var decoder = new TextDecoder("utf-16");
> var s = decoder.decode(data);
> s = s.replace("a", "b");
> var data2 = new TextEncoder(decoder.encoding).encode(s);
>
> where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus
> preserving both the BOM and (for UTF-16) endianness.  I don't actually like
> this, though, because I don't like the idea of decoder.encoding changing
> after the decoder has already been constructed.
>
> I think I agree with just stripping it, and people who want to preserve
> BOMs on write-through can jump the hoops manually (which aren't terribly
> hard).
>

This gets easier if we restrict to encoding UTF-8 which typically doesn't
include BOMs. But it's looking like there's enough desire to keep UTF-16
encoding at the moment. Agree with just stripping it for now.

> Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1)
> giving .encoding = "windows-1252".  That's strange, even when you know why
> it's happening.
>
> Is there any reason to expose the actual "primary" names?  It's not clear
> that the "name" column in the Encoding spec is even intended to be exposed
> to APIs; they look more like labels for specs to refer to internally.
> (Anne?)  If there's no pressing reason to expose this, I'd suggest that the
> .encoding attribute simply return the name that was passed to the
> constructor.
>
> It's still not ideal (it's weird that asking for ASCII gives you something
> other than ASCII in the first place), but it at least seems a bit less
> strange.  The "nice" fix would be to implement actual ASCII, ISO-8859-1,
> ISO-8859-9, etc. charsets, but that just means extra implementation work
> (and some charset proliferation) without use cases.
>

Leaning towards simply dropping the attribute. Does anyone advocate for
keeping it?