[whatwg] [encoding] utf-16

Anne van Kesteren annevk at opera.com
Wed Dec 28 07:13:56 PST 2011

On Wed, 28 Dec 2011 12:30:49 +0100, Leif Halvard Silli  
<xn--mlform-iua at målform.no> wrote:
> I spotted a shortcoming in your testing:
>> I ran some utf-16 tests using 007A as input data, optionally preceded by
>> FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the
>> Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome
>> 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is  
>> Opera
>> 11.60. Gecko is Nightly 12.0a1 (2011-12-26).
>> HTTP      BOM   Trident  WebKit  Gecko  Presto
>> utf-16    -     7A00     7A00    007A   007A
>> utf-16le  -     7A00     7A00    7A00   7A00
>> utf-16be  -     007A     007A    007A   007A
> The above test row is not complete. You should also run a BOM-less test
> using the UTF-16 label but where the 007A is represented in the
> big-endian way - a bit like I did here:
> <http://malform.no/testing/utf/#html-table-7>. The you get as result
> that Opera and Firefox do not take it for a given that files sent as
> 'utf-16' are big-endian:
>   utf-16    -     gibb*    gibb*   007A   007A
> *gibb = gibberish/mojibake.

I get U+7A00 as I indicated above. I would not qualify that as gibberish  
personally. (My table is somewhat confusing as input 007A was meant to  
describe octets, but the table describes code points.)

Anyway, per  
Presto and Gecko do have some magic, but it seems better if they were the  
same as Trident (and WebKit).

> That the BOM is removed from the output for utf-16be labelled files,
> means that the 'utf-16be' labelled file nevertheless is treated as
> UTF-16 (per UTF-16's specification). (Otherwise, if it had not been
> removed, the BOM character should have caused quirks mode.)
> Taking what you did not test for into account, it would make sense if
> 'utf-16' continues to be treated as a label under which both big-endian
> and litt-endian can be expected. And thus, that Webkit and IE starts to
> detect when UTF-16 is big-endian, but without a BOM.

I am not sure what you are trying to say here.

Anne van Kesteren

More information about the whatwg mailing list