[whatwg] StringEncoding: Allowed encodings for TextEncoder

Fri Aug 10 21:16:48 PDT 2012

On Thu, Aug 9, 2012 at 10:42 AM, Joshua Bell <jsbell at chromium.org> wrote:
> On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell <jsbell at chromium.org> wrote:
>
>>
>>
>> On Wed, Aug 8, 2012 at 2:48 AM, James Graham <jgraham at opera.com> wrote:
>>
>>> On 08/07/2012 07:51 PM, Jonas Sicking wrote:
>>>
>>>  I don't mind supporting *decoding* from basically any encoding that
>>>> Anne's spec enumerates. I don't see a downside with that since I
>>>> suspect most implementations will just call into a generic decoding
>>>> backend anyway, and so supporting the same set of encodings as for
>>>> other parts of the platform should be relatively easy.
>>>>
>>>
>>> [...]
>>>
>>>
>>>  However I think we should consider restricting support to a smaller
>>>> set of encodings for while *encoding*. There should be little reason
>>>> for people today to produce text in non-utf formats. We might even be
>>>> able to get away with only supporting UTF8, though I wouldn't be
>>>> surprised if there are reasonably modern file formats which use utf16.
>>>>
>>>
>>> FWIW, I agree with the decode-from-all-platform-**encodings
>>> encode-to-utf[8|16] position.
>>>
>>
>> Any disagreement on limiting the supported encodings to utf-8, utf-16, and
>> utf-16be, while permitting decoding of all encodings in the Encoding spec?
>>
>> (This eliminates the "what to do on encoding error" issue nicely, still
>> need to resolve the BOM issue though.)
>>
>
> http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the
> supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.
>
> I'm tempted to take it further to just UTF-8 and see if anyone complains.
>
> Jury is still out on the decode-with-BOM issue - I need to reason through
> Glenn's suggestions on the "open issues" thread.
>
> I added a related open issue raised by Glenn, summarized as "... suggest
> that the .encoding attribute simply return the name that was passed to
> the constructor." - taking this further, perhaps the attribute should be
> eliminated as callers could apply it themselves.

The spec now contains the following text:

"NOTE: Because only UTF encodings are supported, and because of the
algorithm used to convert a DOMString to a sequence of Unicode
characters, no input can cause the encoding process to emit an encoder
error."

This is not correct. A DOMString is not a sequence of Unicode
characters, it's a UTF16 encoded string (this is per EcmaScript). Thus
it can contain unpaired surrogates and so the encoding process can
result in encoder errors.

As I've suggested earlier, I think we should deal with this by simply
emitting Unicode replacement characters for these encoder errors (i.e.
for unpaired surrogates).

/ Jonas