[whatwg] API for encoding/decoding ArrayBuffers into text

Tue Mar 13 17:19:24 PDT 2012

Using Views instead of specifying the offset and length sounds good.

On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson <ian at hixie.ch> wrote:

>  - What's the use case for supporting anything but UTF-8?
>

Other Unicode encodings may be useful, to decode existing file formats
containing (most likely at a minimum) UTF-16.  I don't feel strongly about
that, though; we're stuck with UTF-16 as an internal representation in the
platform, but that doesn't necessarily mean we need to support it as a
transfer encoding.

For non-Unicode legacy encodings, I think that even if use cases exist,
they should be given more than the usual amount of scrutiny before being
supported.

On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. <jackalmage at gmail.com>wrote:

> Python throws errors by default, but both functions have an additional
> argument specifying an alternate strategy.  In particular,
> bytes.decode can either drop the invalid bytes, replace them with a
> replacement char (which I agree should be U+FFFD), or replace them
> with XML entities; str.encode can choose to drop characters the
> encoding doesn't support.
>

Supporting throwing is okay if it's really wanted, but the default should
be replacement.  It reduces fatal errors to (usually) non-fatal
replacement, for obscure cases that people generally don't test.  It's a
much more sane default failure mode.

As another option, never throw, but allow returning the number of
conversion errors:

results = encode("abc\uD800def", outputView, "UTF-8");

where results.inputConsumed is the number of words consumed in myString,
results.outputWritten is the number of UTF-8 bytes written, and
results.errors is 1.

That also allows block-by-block conversion; for example, to convert as many
complete characters as possible into a fixed-size buffer for transmission,
then starting again at the next unencoded character.

One more idea, while I'm brainstorming: if outputView is null, allocate an
ArrayBuffer of the necessary size, storing it in results.output.  That
eliminates the need for a separate length pass, without bloating the API
with another overload.

On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell <jsbell at chromium.org> wrote:

> (Cue a strong "nooooooo!" from Anne.)
>

(Count me in on that, too.  Heuristics bad.)

 Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)
>

UTF-16 "sanitization" (replacing mismatched surrogates with U+FFFD) doesn't
change the size of the output, actually.

-- 
Glenn Maynard