[whatwg] API for encoding/decoding ArrayBuffers into text

Tue Mar 13 16:50:43 PDT 2012

On Tue, Mar 13, 2012 at 4:28 PM, Ian Hickson <ian at hixie.ch> wrote:

> On Tue, 13 Mar 2012, Joshua Bell wrote:
> > On Tue, Mar 13, 2012 at 4:10 PM, Jonas Sicking <jonas at sicking.cc> wrote:
> > > On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell <kbr at google.com>
> > > wrote:
> > > > Joshua Bell has been working on a string encoding and decoding API
> > > > that supports the needed encodings, and which is separable from the
> > > > core typed array API:
> > > >
> > > > http://wiki.whatwg.org/wiki/StringEncoding
> > > >
> > > > This is the direction I prefer. String encoding and decoding seems
> > > > to be a complex enough problem that it should be expressed
> > > > separately from the typed array spec itself.
>
> Some quick feedback:
>
>  - [OmitConstructor] doesn't seem to be WebIDL
>

Historically, the spec started off as an addition to the Typed Array spec
that splintered off; cleanup is definitely needed, thanks.

>  - please don't allow UAs to implement other encodings. You should list
>   the exact set of supported encodings and the exact labels that should
>   be recognised as meaning those encodings, and disallow all others.
>   Otherwise, we'll be in a never-ending game of reverse-engineering each
>   others' lists of supported encodings and it'll keep growing.
>
>  - What's the use case for supporting anything but UTF-8?
>

For both of the above: initially suggested use cases included parsing data
as esoteric as ID3 tags in MP3 files, where encoding unspecified and is
guessed at by decoders, and includes non-Unicode encodings. It was
suggested that the encoding sniffing capabilities of browsers be leveraged.
(Cue a strong "nooooooo!" from Anne.)

I completely agree that we should explicitly list the set of encoding
supported and should remove the "other encodings" allowance.

Whether we should restrict it as far as UTF-8 depends on whether we
envision this API only used for parsing/serializing newly defined data
formats, or whether there is consideration for interop with previously
existing formats data formats and code. For example, "BINARY" would be used
to bridge the existing atob()/btoa() methods with Typed Arrays (although
base64 directly in/out of Typed Arrays would be preferable).

Jonas, since you started this thread - did your content authors mention
encodings?

>  - Having a mechanism that lets you encode the string and get a length
>   separate from the mechanism that lets you encode the string and get the
>   encoded string seems like it would encourage very inefficient code. Can
>   we instead have a mechanism that returns both at once? Or is the idea
>   that for some encodings getting the encoded length is much quicker than
>   getting the actual string?
>

The use case was to compute the size necessary to allocate a single buffer
into which may be encoded multiple strings and other data, rather than
allocating multiple small buffers and then copying strings into a larger
buffer.

Ignoring the issue of invalid code points, the length calculations for
non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
be sanitized, that case is trivially 2x the JS string length.)

>  - Seems weird that integers and strings would have such different APIs
>   for doing the same thing. Why can't we handle them equivalently? As in:
>
>     len = view.setString(strings[i],
>                          offset + Uint32Array.BYTES_PER_ELEMENT,
>                          "UTF-8");
>     view.setUint32(offset, len);
>     offset += Uint32Array.BYTES_PER_ELEMENT + len;
>

Heh, that's where the discussion started, actually. We wanted to keep the
DataView interface simple, and potentially support encoding into plain JS
arrays and/or non-TypedArray support that appeared to be on the horizon for
JS.

> HTH,
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>