[whatwg] API for encoding/decoding ArrayBuffers into text

Tue Mar 13 17:01:42 PDT 2012

On Tue, 13 Mar 2012, Joshua Bell wrote:
> 
> For both of the above: initially suggested use cases included parsing 
> data as esoteric as ID3 tags in MP3 files, where encoding unspecified 
> and is guessed at by decoders, and includes non-Unicode encodings. It 
> was suggested that the encoding sniffing capabilities of browsers be 
> leveraged. [...]
> 
> Whether we should restrict it as far as UTF-8 depends on whether we 
> envision this API only used for parsing/serializing newly defined data 
> formats, or whether there is consideration for interop with previously 
> existing formats data formats and code.

Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I 
agree we should support them; if that's the case, we should survey those 
use cases to work out what the set of encodings we need is, and add just 
those.

> >  - Having a mechanism that lets you encode the string and get a length
> >   separate from the mechanism that lets you encode the string and get the
> >   encoded string seems like it would encourage very inefficient code. Can
> >   we instead have a mechanism that returns both at once? Or is the idea
> >   that for some encodings getting the encoded length is much quicker than
> >   getting the actual string?
> >
> 
> The use case was to compute the size necessary to allocate a single buffer
> into which may be encoded multiple strings and other data, rather than
> allocating multiple small buffers and then copying strings into a larger
> buffer.
> 
> Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)

Yeah, but surely we'll mainly be doing stuff with UTF-8...

One option is to return an opaque object of the form:

   interface EncodedString {
     readonly attributes unsigned long length;
     // internally has a copy of the encoded string
   }

...and then have view.setString take this EncodedString object. At least 
then you get it down to an extraneous copy, rather than an extraneous 
encode. Still not ideal though.

> >  - Seems weird that integers and strings would have such different APIs
> >   for doing the same thing. Why can't we handle them equivalently? As in:
> >
> >     len = view.setString(strings[i],
> >                          offset + Uint32Array.BYTES_PER_ELEMENT,
> >                          "UTF-8");
> >     view.setUint32(offset, len);
> >     offset += Uint32Array.BYTES_PER_ELEMENT + len;
> 
> Heh, that's where the discussion started, actually. We wanted to keep 
> the DataView interface simple, and potentially support encoding into 
> plain JS arrays and/or non-TypedArray support that appeared to be on the 
> horizon for JS.

I see where you're coming from, but I think we should look at the platform 
as a whole, not just one API. It doesn't help the platform as a whole if 
we just have the same features split across two interfaces, the complexity 
is even slightly higher than just having one consistent API that does ints 
and strings equivalently.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'