[whatwg] API for encoding/decoding ArrayBuffers into text

Wed Mar 21 06:54:35 PDT 2012

On Wed, Mar 21, 2012 at 3:27 AM, Jonas Sicking <jonas at sicking.cc> wrote:

> 1) Create an API which forces consumers to do state handling. Probably
> leading to people creating wrappers which essentially implement option
> 3
>

It's not the same.  Please look at how ISO-2022 works: the stream has
*long-lived* state, with escape sequences that change the meaning of later
code sequences in the stream.  For example, you have to remember whether GR
is encoding G1, G2 or G3.  This can't be stored merely by remembering the
next input byte you have to start at.

As Yui said, the sort of state UTF-8 has isn't what people mean when we
talk about "stateful encodings".

On Wed, Mar 21, 2012 at 3:34 AM, NARUSE, Yui <naruse at airemix.jp> wrote:

> For streaming conversion, it needs state even if the encoding is stateless.
> When the given partial input is finished at the middle of a character
> like "\xE3\x81\x82\xC2", the conversion consumes 4 bytes, output one
> character
> "\u3042", and remember the partial bytes "\xC2". This bytes is the state.
>

You don't need to do that.  You can simply convert as many output
codepoints as can be *completely* converted.  In this example, you'd
consume 3 bytes and output one codepoint.  You don't consume data that you
can't immediately convert, so you don't have to buffer anything.

(We don't have to do it that way, of course; just pointing out that you
don't *need* special state for streaming encodings like UTF-8.)

Anyway, they need error if the byte sequence is invalid for the encoding.
>

Errors were discussed previously: by default errors output U+FFFD (or
another replacement character, for encoding unsupported characters to
non-Unicode encodings), and we may have an option to turn it into an
exception.

-- 
Glenn Maynard