[whatwg] API for encoding/decoding ArrayBuffers into text

Tue Mar 20 10:39:12 PDT 2012

On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard <glenn at zewt.org> wrote:

> On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>
>> Why are encodings different than other parts of the API where you
>>
> indeed have to know what works and what doesn't.
>>
>
> Do you memorize lists of encodings?  I certainly don't.  I look them up as
> needed.
>
> UTF8 is stateful, so I disagree.
>>
>
> No, UTF-8 doesn't require a stateful decoder to support streaming.  You
> decode up to the last codepoint that you can decode completely.  The return
> values are the output data, the number of bytes output, and the number of
> bytes consumed; that's all you need to restart decoding later.  That's the
> iconv(3) approach that we're probably all familiar with, which works with
> almost all encodings.
>
> ISO-2022 encodings are stateful: you have to persistently remember the
> character subsets activated by earlier escape sequences.  An iconv-like
> streaming API is impossible; to support streamed decoding, you'd need to
> have a decoder object that the user keeps around in order to store that
> state.  http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure
>

Which seems like it leaves us with these options:

1. Only support encodings with stateless coding (possibly down to a minimum
of UTF-8)
2. Only provide an API supporting non-streaming coding (i.e. whole
strings/whole buffers)
3. Expand the API to return encoder/decoder objects that capture state

Any others?

Trying to do simplify the problem but take on both (1) and (2) without (3)
would lead to an API that could not encompass (3) in the future, which
would be a mistake.

I'll throw out that the in-progress design of a Globalization API for
ECMAScript -
http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ -
is currently spec'd to both build on the existing locale-aware methods on
String/Number/Date prototypes as conveniences, as well as introducing the
Collator and *Format objects.

Should we start with UTF-8-only/non-streaming methods on
DOMString/ArrayBufferView, and avoid constraining a future API supporting
multiple, possibly stateful encodings and streaming?