[whatwg] Encoding: API

Wed Oct 10 23:36:36 PDT 2012

On Thu, Oct 11, 2012 at 6:09 AM, Anne van Kesteren <annevk at annevk.nl> wrote:
> On Wed, Oct 10, 2012 at 7:28 PM, Joshua Bell <jsbell at chromium.org> wrote:
>> Practically speaking, this would mean refactoring the combined spec so that
>> the current BOM handling is defined for parsing web content outside of the
>> API rather than requiring the API to hack around it.
>
> You would still get the hack because the API requires special
> treatment for "utf-16". Given that per Unicode "utf-16le" and
> "utf-16be" outlaw the BOM, maybe a good solution would be a flag to
> disable BOM handling as seen by the decode algorithm? So the decoder
> gets a disableBOM flag that defaults to false? That would only require
> a special case for BOM handling on top of what there is today, which
> seems a fair bit cleaner.

The main problem with this is that you would get a leading BOM in
utf-8 if the content includes that. An unlikely scenario, but maybe we
want to take care of that. Another approach I thought about is that we
have an "API decode" algorithm, which is very similar to

http://encoding.spec.whatwg.org/#decode

However, instead of setting the encoding, it checks if the leading
bytes match, and if the encoding matches, and only then does it set
the offset. So the BOM would be skipped for utf-8/utf-16 if it was a
valid BOM, but a BOM invalid for the given encoding would never switch
the encoding.

The behavior of the normal decode algorithm does not need to be
exposed through the API I think, unless a use case comes up at some
point.

-- 
http://annevankesteren.nl/