[whatwg] StringEncoding open issues

Mon Aug 6 14:50:56 PDT 2012

On Mon, Aug 6, 2012 at 11:29 AM, Joshua Bell <jsbell at chromium.org> wrote:
> Regarding the API proposal at: http://wiki.whatwg.org/wiki/StringEncoding
>
> It looks like we've got some developer interest in implementing this, and
> need to nail down the open issues. I encourage folks to look over the
> "Resolved" issues in the wiki page and make sure the resolutions - gathered
> from loose consensus here and offline discussion - are truly resolved or if
> anything is not future-proof and should block implementations from
> proceeding. Also, look at the "Notes to Implementers" section; this should
> be non-controversial but may be non-obvious.
>
> This leaves two open issues: behavior on encoding error, and handling of
> Byte Order Marks (BOMs)
>
> == Encoding Errors ==
>
> The proposal builds on Anne's
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
> which defines when encodings should emit an encoder error. In that spec
> (which describes the existing behavior of Web browsers) encoders are used
> in a limited fashion, e.g. for encoding form results before submission via
> HTTP, and hence the cases are much more restricted than the errors
> encountered when browsers are asked to decode content from the wild. As
> noted, the encoding process could terminate when an error is emitted.
> Alternately (and as is necessary for forms, etc) there is a
> use-case-specific escaping mechanism for non-encodable code points.
>
> The proposed TextDecoder object takes a TextDecoderOptions options with a
> |fatal| flag that controls the decode behavior in case of error - if
> |fatal| is unset (default) a decode error produces a fallback character
> (U+FFFD); if |fatal| is set then a DOMException is raised instead.
>
> No such option is currently proposed for the TextEncoder object; the
> proposal dictates that a DOMException is thrown if the encoder emits an
> error. I believe this is sufficient for V1, but want feedback. For V2 (or
> now, if desired), the API could be extended to accept an options object
> allowing for some/all of these cases;

Not introducing options for the encoder for V1 sounds like a good idea
to me. However I would definitely prefer if the default for encoding
matches the default for decoding and used replacement characters
rather than threw an exception.

This also matches what the recent WebSocket spec which recently
changed from throwing to using replacement characters for encoding.

The reason WebSocket was changed was because it's relatively easy to
make a mistake and cause a surrogate UTF16 pair be cut into two, which
results in an invalidly encoded DOMString. The problem with this is
that it's very data dependent and so might not happen on the
developer's computer, but only in the wild when people write text
which uses non-BMP characters. In such cases throwing an exception
will likely result in more breakage than using a replacement
character.

> * Don't throw, instead emit a standard/encoding-specific replacement
> character (e.g. '?')

Yes, using the replacement character sounds good to me.

> * Don't throw, instead emit a fixed placeholder character (byte?) sequence
> * Don't throw, instead call a user-defined callback and allow it to produce
> a replacement "escaped" character sequence, e.g. "&#xXXXX;"
>
> The latter seems the most flexible (superset of the rest) but is probably
> overkill for now. Since it can be added in easily later, can we defer until
> we have implementer and user feedback?

Indeed, we can explore these options if the need arises.

> == Byte Order Marks (BOMs) ==
>
> Once again, the proposal builds on Anne's
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
> which describes the existing behavior of Web browsers. In the wild,
> browsers deal with a variety of mechanisms for indicating the encoding of
> documents (server headers, meta tags, XML preludes, etc), many of which are
> blatantly incorrect or contradictory. One form is fortunately rarely wrong
> - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> the byte order mark (the encoding-specific serialization of U+FEFF). This
> is built into the Encoding spec - given a byte sequence to decode and an
> encoding label, the label is ignored if the sequence starts with one of the
> three UTF BOMs, and the BOM-indicated encoding is used to decode the rest
> of the stream.
>
> The proposed API will have different uses, so it is unclear that this is
> necessary or desirable.
>
> At a minimum, it is clear that:
>
> * If one of the UTF encodings is specified AND the BOM matches then the
> leading BOM character (U+FEFF) MUST NOT be emitted in the output character
> sequence (i.e. it is silently consumed)

Agreed.

> Less clear is this behavior in these two cases.
>
> * If one of the UTF encodings is specified AND and a different BOM is
> present (e.g. UTF-16LE but a UTF-16BE BOM)
> * If one of the non-UTF encodings is specified AND a UTF BOM is present
>
> Options include:
> * Nothing special - decoder does what it will with the bytes, possibly
> emitting garbage, possibly throwing
> * Raise a DOMException
> * Switch the decoder from the user-specified encoding to the DOM-specified
> encoding
>
> The latter seems the most helpful when the proposed API is used as follows:
>
> var s = TextDecoder().decode(bytes); // handles UTF-8 w/o BOM and any UTF
> w/ BOM
>
> ... but it does seem a little weird when used like this;
>
> var d = TextDecoder('euc-jp');
> assert(d.encoding === 'euc-jp');
> var s = d.decode(new Uint8Array([0xFE]), {stream: true});
> assert(d.encoding === 'euc-jp');
> assert(s.length === 0); // can't emit anything until BOM is definitely
> passed
> s += d.decode(new Uint8Array([0xFF]), {stream: true});
> assert(d.encoding === 'utf-16be'); // really?

I would add the case of "no encoding was specified and a BOM was
detected" to the "clear" category. I.e. I think we clearly here should
honor the BOM.

As for what to do if an encoding is specified (UTF or otherwise), and
a different BOM is detected, that I agree is much less clear. I don't
have a strong opinion of if it should throw or attempt to decode
(possibly into garbage). But silently ignoring the specified encoding
feels strange. In gecko we decided to throw since that would leave the
most options for adjusting to spec changes.

/ Jonas