[whatwg] API for encoding/decoding ArrayBuffers into text
Ian Hickson
ian at hixie.ch
Tue Mar 13 17:01:42 PDT 2012
On Tue, 13 Mar 2012, Joshua Bell wrote:
>
> For both of the above: initially suggested use cases included parsing
> data as esoteric as ID3 tags in MP3 files, where encoding unspecified
> and is guessed at by decoders, and includes non-Unicode encodings. It
> was suggested that the encoding sniffing capabilities of browsers be
> leveraged. [...]
>
> Whether we should restrict it as far as UTF-8 depends on whether we
> envision this API only used for parsing/serializing newly defined data
> formats, or whether there is consideration for interop with previously
> existing formats data formats and code.
Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I
agree we should support them; if that's the case, we should survey those
use cases to work out what the set of encodings we need is, and add just
those.
> > - Having a mechanism that lets you encode the string and get a length
> > separate from the mechanism that lets you encode the string and get the
> > encoded string seems like it would encourage very inefficient code. Can
> > we instead have a mechanism that returns both at once? Or is the idea
> > that for some encodings getting the encoded length is much quicker than
> > getting the actual string?
> >
>
> The use case was to compute the size necessary to allocate a single buffer
> into which may be encoded multiple strings and other data, rather than
> allocating multiple small buffers and then copying strings into a larger
> buffer.
>
> Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)
Yeah, but surely we'll mainly be doing stuff with UTF-8...
One option is to return an opaque object of the form:
interface EncodedString {
readonly attributes unsigned long length;
// internally has a copy of the encoded string
}
...and then have view.setString take this EncodedString object. At least
then you get it down to an extraneous copy, rather than an extraneous
encode. Still not ideal though.
> > - Seems weird that integers and strings would have such different APIs
> > for doing the same thing. Why can't we handle them equivalently? As in:
> >
> > len = view.setString(strings[i],
> > offset + Uint32Array.BYTES_PER_ELEMENT,
> > "UTF-8");
> > view.setUint32(offset, len);
> > offset += Uint32Array.BYTES_PER_ELEMENT + len;
>
> Heh, that's where the discussion started, actually. We wanted to keep
> the DataView interface simple, and potentially support encoding into
> plain JS arrays and/or non-TypedArray support that appeared to be on the
> horizon for JS.
I see where you're coming from, but I think we should look at the platform
as a whole, not just one API. It doesn't help the platform as a whole if
we just have the same features split across two interfaces, the complexity
is even slightly higher than just having one consistent API that does ints
and strings equivalently.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list