[whatwg] API for encoding/decoding ArrayBuffers into text
glenn at zewt.org
Tue Mar 13 17:19:24 PDT 2012
Using Views instead of specifying the offset and length sounds good.
On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson <ian at hixie.ch> wrote:
> - What's the use case for supporting anything but UTF-8?
Other Unicode encodings may be useful, to decode existing file formats
containing (most likely at a minimum) UTF-16. I don't feel strongly about
that, though; we're stuck with UTF-16 as an internal representation in the
platform, but that doesn't necessarily mean we need to support it as a
For non-Unicode legacy encodings, I think that even if use cases exist,
they should be given more than the usual amount of scrutiny before being
On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. <jackalmage at gmail.com>wrote:
> Python throws errors by default, but both functions have an additional
> argument specifying an alternate strategy. In particular,
> bytes.decode can either drop the invalid bytes, replace them with a
> replacement char (which I agree should be U+FFFD), or replace them
> with XML entities; str.encode can choose to drop characters the
> encoding doesn't support.
Supporting throwing is okay if it's really wanted, but the default should
be replacement. It reduces fatal errors to (usually) non-fatal
replacement, for obscure cases that people generally don't test. It's a
much more sane default failure mode.
As another option, never throw, but allow returning the number of
results = encode("abc\uD800def", outputView, "UTF-8");
where results.inputConsumed is the number of words consumed in myString,
results.outputWritten is the number of UTF-8 bytes written, and
results.errors is 1.
That also allows block-by-block conversion; for example, to convert as many
complete characters as possible into a fixed-size buffer for transmission,
then starting again at the next unencoded character.
One more idea, while I'm brainstorming: if outputView is null, allocate an
ArrayBuffer of the necessary size, storing it in results.output. That
eliminates the need for a separate length pass, without bloating the API
with another overload.
On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell <jsbell at chromium.org> wrote:
> (Cue a strong "nooooooo!" from Anne.)
(Count me in on that, too. Heuristics bad.)
Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)
UTF-16 "sanitization" (replacing mismatched surrogates with U+FFFD) doesn't
change the size of the output, actually.
More information about the whatwg