[whatwg] API for encoding/decoding ArrayBuffers into text

Fri Mar 16 16:42:20 PDT 2012

On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell <jsbell at chromium.org> wrote:
> On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard <glenn at zewt.org> wrote:
>
>> On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>>
>>> What's the use-case for the "stringLength" function? You can't decode
>>> into an existing datastructure anyway, so you're ultimately forced to
>>> call "decode" at which point the "stringLength" function hasn't helped
>>> you.
>>>
>>
>> stringLength doesn't return the length of the decoded string.  It returns
>> the byte offset of the first \0 (or the length of the whole buffer, if
>> none), for decoding null-terminated strings.  For multibyte encodings (eg.
>> everything except UTF-16 and friends), it's just memchr(), so it's much
>> faster than actually decoding the string.
>>
>
> And just to be clear, the use case is decoding data formats where string
> fields are variable length null terminated.
>
>
>> Currently the use-case of simply wanting to convert a string to a
>>> binary buffer is a bit cumbersome. You first have to call the
>>> "encodedLength" function, then allocate a buffer of the right size,
>>> then call the "encode" function.
>>
>>
>> I suggested eg.
>>
>> result = encode("string", "utf-8", null).output;
>>
>> which would create an ArrayBuffer of the required size.  Presumably the
>> null ArrayBufferView argument would be optional, so you could just say
>> encode("string", "utf-8").
>>
>
> I think we want both encoding and destination to be optional. That leads us
> to an API like:
>
> out_dict = stringEncoding.encode("string", opt_dict);
>
> .. where both out_dict and opt_dict are WebIDL Dictionaries:
>
> opt_dict keys: view, encoding
> out_dict keys: charactersWritten, byteWritten, output
>
> ... where output === view if view is supplied, otherwise a new Uint8Array
> (or Uint8ClampedArray??)
>
> If this instead is attached to String, it would look like:
>
> out_dict = my_string.encode(opt_dict);
>
> If it were attached to ArrayBufferView, having a right-size buffer
> allocated for the caller gets uglier unless we include a static version.

Using input and output dictionaries is definitely messy, but I can't
see a better way either. And I think ES6 is adding some syntax here
that will make developer's lives better (deconstructing assignments)

> It doesn't seem possible to implement the 'encode' function without
>>> doing multiple scans over the string. The implementation seems
>>> required both to check that the data can be decoded using the
>>> specified encoding, as well as check that the data will fit in the
>>> passed in buffer. Only then can the implementation start decoding the
>>> data. This seems problematic.
>>>
>>
>> Only if it guarantees that it doesn't write anything to the output buffer
>> unless the entire result will fit.  I don't think we need to do that; just
>> guarantee that it'll be truncated on a whole codepoint.
>>
>
> Agreed. Input/output dicts mean the API documentation a caller needs to
> read to understand the usage is more complex than a function signature
> which is why I resisted them, but it does seem like the best approach.
> Thanks for pushing, Glenn!
>
> In the create-a-buffer-on-the-fly case there will be some memory juggling
> going on, either by initially over allocating or reallocating/moving.

The implementation can always figure out what strategy fits its own
requirements best with regards to memory allocation. I suspect that
right now in Firefox the fastest implementation would be to scan
through the string once to measure the desired buffer size, then
allocate and write into the allocated buffer.

The problem is that the way that the encoding function is defined
right now, you are not allowed to write any data if you are throwing
for whatever reason, which means that you have to do a scan first to
see if you need to throw, and then do a separate pass to actually
encode the data. I think we need to change that such that when an
exception is thrown that data should be written up to the point that
causes the exception.

>> I also don't think it's a good idea to throw an exception for encoding
>>> errors. Better to convert characters to the unicode replacement
>>> character. I believe we made a similar change to the WebSockets
>>> specification recently.
>>>
>>
>> Was that change made?  I filed
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
>> to be undecided.
>>
>
> Settling on an options dict means adding a flag to control this behavior
> (throws: true ?) doesn't extend the API surface significantly.

Sounds good to me. Though I would still strongly prefer the default to
be non-throwing as to minimize the risk of website breakage in the
case of bugs. Especially since these bugs are so data dependent and
are likely to not happen on a developers computer.

/ Jonas