[whatwg] StringEncoding open issues

Thu Aug 16 16:54:43 PDT 2012

On Wed, Aug 15, 2012 at 5:30 PM, Glenn Maynard <glenn at zewt.org> wrote:

> On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell <jsbell at chromium.org> wrote:
>
>>    - Create an encoder with TextDecoder() and if present a BOM will be
>>
>>    respected (and consumed) otherwise default to UTF-8
>>
>
> Let's not default to "autodetect Unicode formats".  It encourages people
> to support UTF-16 when they may not mean to.  If BOM detection for both
> UTF-8 and UTF-16 is wanted, I'd suggest something explicit, like "utf-*".
>
> If the argument to the ctor is optional, I think the default should be
> purely UTF-8.
>

Works for me. In the algorithm specified in the email, this simply removes
the clause "If encoding is not specified, set an internal useBOM flag" -
namely, only "utf-16" gets the useBOM flag.

I'll attempt to wedge this into the spec soon.

>  This gets easier if we restrict to encoding UTF-8 which typically doesn't
>> include BOMs. But it's looking like there's enough desire to keep UTF-16
>> encoding at the moment. Agree with just stripping it for now.
>>
>
> UTF-8 sometimes does have a BOM, especially in Windows where applications
> sometimes use it to distinguish UTF-8 from ACP text files (which are just
> as common as ever--Windows has made no motion away from legacy encodings
> whatsoever).
>

Good point. Ah, Notepad, my old friend...

> Stripping the BOM can cause those applications to misinterpret the files
> as ACP.
>
> Anyway, even if the encoding API gives a "helper" for this, figuring out
> how that works would probably be more effort for developers than just
> peeking at the ArrayBuffer for the BOM and adding it back in manually.
> (I'm pretty sure anybody who knows enough to pay attention to this in the
> first place will have no trouble doing that.)  So, yeah, let's not worry
> about this.
>
> --
> Glenn Maynard
>
>