[whatwg] API for encoding/decoding ArrayBuffers into text

Ian Hickson ian at hixie.ch
Tue Mar 13 17:01:42 PDT 2012

On Tue, 13 Mar 2012, Joshua Bell wrote:
> For both of the above: initially suggested use cases included parsing 
> data as esoteric as ID3 tags in MP3 files, where encoding unspecified 
> and is guessed at by decoders, and includes non-Unicode encodings. It 
> was suggested that the encoding sniffing capabilities of browsers be 
> leveraged. [...]
> Whether we should restrict it as far as UTF-8 depends on whether we 
> envision this API only used for parsing/serializing newly defined data 
> formats, or whether there is consideration for interop with previously 
> existing formats data formats and code.

Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I 
agree we should support them; if that's the case, we should survey those 
use cases to work out what the set of encodings we need is, and add just 

> >  - Having a mechanism that lets you encode the string and get a length
> >   separate from the mechanism that lets you encode the string and get the
> >   encoded string seems like it would encourage very inefficient code. Can
> >   we instead have a mechanism that returns both at once? Or is the idea
> >   that for some encodings getting the encoded length is much quicker than
> >   getting the actual string?
> >
> The use case was to compute the size necessary to allocate a single buffer
> into which may be encoded multiple strings and other data, rather than
> allocating multiple small buffers and then copying strings into a larger
> buffer.
> Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)

Yeah, but surely we'll mainly be doing stuff with UTF-8...

One option is to return an opaque object of the form:

   interface EncodedString {
     readonly attributes unsigned long length;
     // internally has a copy of the encoded string

...and then have view.setString take this EncodedString object. At least 
then you get it down to an extraneous copy, rather than an extraneous 
encode. Still not ideal though.

> >  - Seems weird that integers and strings would have such different APIs
> >   for doing the same thing. Why can't we handle them equivalently? As in:
> >
> >     len = view.setString(strings[i],
> >                          offset + Uint32Array.BYTES_PER_ELEMENT,
> >                          "UTF-8");
> >     view.setUint32(offset, len);
> >     offset += Uint32Array.BYTES_PER_ELEMENT + len;
> Heh, that's where the discussion started, actually. We wanted to keep 
> the DataView interface simple, and potentially support encoding into 
> plain JS arrays and/or non-TypedArray support that appeared to be on the 
> horizon for JS.

I see where you're coming from, but I think we should look at the platform 
as a whole, not just one API. It doesn't help the platform as a whole if 
we just have the same features split across two interfaces, the complexity 
is even slightly higher than just having one consistent API that does ints 
and strings equivalently.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list