[whatwg] API for encoding/decoding ArrayBuffers into text

Fri Mar 16 09:19:44 PDT 2012

On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard <glenn at zewt.org> wrote:

> On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>
>> What's the use-case for the "stringLength" function? You can't decode
>> into an existing datastructure anyway, so you're ultimately forced to
>> call "decode" at which point the "stringLength" function hasn't helped
>> you.
>>
>
> stringLength doesn't return the length of the decoded string.  It returns
> the byte offset of the first \0 (or the length of the whole buffer, if
> none), for decoding null-terminated strings.  For multibyte encodings (eg.
> everything except UTF-16 and friends), it's just memchr(), so it's much
> faster than actually decoding the string.
>

And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.

> Currently the use-case of simply wanting to convert a string to a
>> binary buffer is a bit cumbersome. You first have to call the
>> "encodedLength" function, then allocate a buffer of the right size,
>> then call the "encode" function.
>
>
> I suggested eg.
>
> result = encode("string", "utf-8", null).output;
>
> which would create an ArrayBuffer of the required size.  Presumably the
> null ArrayBufferView argument would be optional, so you could just say
> encode("string", "utf-8").
>

I think we want both encoding and destination to be optional. That leads us
to an API like:

out_dict = stringEncoding.encode("string", opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding
out_dict keys: charactersWritten, byteWritten, output

... where output === view if view is supplied, otherwise a new Uint8Array
(or Uint8ClampedArray??)

If this instead is attached to String, it would look like:

out_dict = my_string.encode(opt_dict);

If it were attached to ArrayBufferView, having a right-size buffer
allocated for the caller gets uglier unless we include a static version.

It doesn't seem possible to implement the 'encode' function without
>> doing multiple scans over the string. The implementation seems
>> required both to check that the data can be decoded using the
>> specified encoding, as well as check that the data will fit in the
>> passed in buffer. Only then can the implementation start decoding the
>> data. This seems problematic.
>>
>
> Only if it guarantees that it doesn't write anything to the output buffer
> unless the entire result will fit.  I don't think we need to do that; just
> guarantee that it'll be truncated on a whole codepoint.
>

Agreed. Input/output dicts mean the API documentation a caller needs to
read to understand the usage is more complex than a function signature
which is why I resisted them, but it does seem like the best approach.
Thanks for pushing, Glenn!

In the create-a-buffer-on-the-fly case there will be some memory juggling
going on, either by initially over allocating or reallocating/moving.

> I also don't think it's a good idea to throw an exception for encoding
>> errors. Better to convert characters to the unicode replacement
>> character. I believe we made a similar change to the WebSockets
>> specification recently.
>>
>
> Was that change made?  I filed
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
> to be undecided.
>

Settling on an options dict means adding a flag to control this behavior
(throws: true ?) doesn't extend the API surface significantly.