[whatwg] StringEncoding open issues
Joshua Bell
jsbell at chromium.org
Tue Aug 14 10:34:51 PDT 2012
On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn at zewt.org> wrote:
> I agree with Jonas that encoding should just use a replacement character
> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
> other modes (eg. exceptions and user-specified replacement characters)
> until there's a clear need.
>
> My intuition is that encoding DOMString to UTF-16 should never have errors;
> if there are dangling surrogates, pass them through unchanged. There's no
> point in using a placeholder that says "an error occured here", when the
> error can be passed through in exactly the same form (not possible with eg.
> DOMString->SJIS). I don't feel strongly about this only because outputting
> UTF-16 is so rare to begin with.
>
> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell at chromium.org> wrote:
>
> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> > the byte order mark (the encoding-specific serialization of U+FEFF).
>
>
> This rarely detects the wrong type, but that doesn't mean it's not the
> wrong answer. If my input is meant to be UTF-8, and someone hands me
> BOM-marked UTF-16, I want it to fail in the same way it would if someone
> passed in SJIS. I don't want it silently translated.
>
> On the other hand, it probably does make sense for UTF-16 to switch to
> UTF-16BE, since that's by definition the original purpose of the BOM.
>
> The convention iconv uses, which I think is a useful one, is decoding from
> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".
Let me take a crack at making this into an algorithm:
In the TextDecoder constructor:
- If encoding is not specified, set an internal useBOM flag
- If encoding is specified and is a case insensitive match for "utf-16"
set an internal useBOM flag.
NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
specified the flag is not set.
When decode() is called
- If useBOM is set and the stream offset is 0, then
- If there are not enough bytes to test for a BOM then return without
emitting anything (NOTE: if not streaming an EOF byte would be present in
the stream which would be a negative match for a BOM)
- If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
advance the stream past the BOM. The current encoding is used until the
stream is reset.
- Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
respectively and advance the stream past the BOM. The current encoding is
used until the stream is reset.
- Otherwise, if useBOM is not set and the steam offset is 0, then if the
encoding is "utf-8", "utf-16" or "utf-16be"
- If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF
then let detected encoding be "utf-16", "utf-16be" or "utf-8"
respectively.
If the detected encoding matches the object's encoding, advance
the stream
past the BOM. Otherwise, if the fatal flag is set then throw a
"EncodingError" DOMException. Otherwise, the decoding algorithm proceeds.
- If there are not enough bytes to test for a BOM then return without
emitting anything (NOTE: if not streaming an EOF byte would be inserted
which would be a negative match for a BOM)
Working the "current encoding" switcheroo into the spec will require some
refactoring, so trying to get consensus here first.
In English:
- Create an encoder with TextDecoder() and if present a BOM will be
respected (and consumed) otherwise default to UTF-8
- Create an encoder with TextDecoder("utf-16") and either UTF-16LE or
UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE
(which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present)
- Create an encoder with TextDecoder("utf-8",
{fatal:true}), TextDecoder("utf-16le", {fatal:true}),
TextDecoder("utf-16be", {fatal:true}) and a matching BOM will be consumed,
a mismatching BOM will throw an EncodingError
- Create an encoder with TextDecoder("utf-8"), TextDecoder("utf-16le"),
TextDecoder("utf-16be") and a matching BOM will be consumed, a mismatching
BOM will be blithely decoded (probably giving you replacement characters),
but not throwing.
* If one of the UTF encodings is specified AND the BOM matches then the
> > leading BOM character (U+FEFF) MUST NOT be emitted in the output
> character
> > sequence (i.e. it is silently consumed)
> >
>
> It's a little weird that
>
> data = readFile("user-supplied-file.txt"); // shortcutting for brevity
> var s = new TextDecoder("utf-16").decode(data); // or utf-8
> s = s.replace("a", "b");
> var data2 = new TextEncoder("utf-16").encode(s);
> writeFile("user-supplied-file.txt", data2);
>
> causes the BOM to be quietly stripped away. Normally if you're modifying a
> file, you want to pass through the BOM (or lack thereof) untouched.
>
> One way to deal with this could be:
>
> var decoder = new TextDecoder("utf-16");
> var s = decoder.decode(data);
> s = s.replace("a", "b");
> var data2 = new TextEncoder(decoder.encoding).encode(s);
>
> where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus
> preserving both the BOM and (for UTF-16) endianness. I don't actually like
> this, though, because I don't like the idea of decoder.encoding changing
> after the decoder has already been constructed.
>
> I think I agree with just stripping it, and people who want to preserve
> BOMs on write-through can jump the hoops manually (which aren't terribly
> hard).
>
This gets easier if we restrict to encoding UTF-8 which typically doesn't
include BOMs. But it's looking like there's enough desire to keep UTF-16
encoding at the moment. Agree with just stripping it for now.
> Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1)
> giving .encoding = "windows-1252". That's strange, even when you know why
> it's happening.
>
> Is there any reason to expose the actual "primary" names? It's not clear
> that the "name" column in the Encoding spec is even intended to be exposed
> to APIs; they look more like labels for specs to refer to internally.
> (Anne?) If there's no pressing reason to expose this, I'd suggest that the
> .encoding attribute simply return the name that was passed to the
> constructor.
>
> It's still not ideal (it's weird that asking for ASCII gives you something
> other than ASCII in the first place), but it at least seems a bit less
> strange. The "nice" fix would be to implement actual ASCII, ISO-8859-1,
> ISO-8859-9, etc. charsets, but that just means extra implementation work
> (and some charset proliferation) without use cases.
>
Leaning towards simply dropping the attribute. Does anyone advocate for
keeping it?
More information about the whatwg
mailing list