[whatwg] [encoding] utf-16
Anne van Kesteren
annevk at opera.com
Wed Dec 28 07:13:56 PST 2011
On Wed, 28 Dec 2011 12:30:49 +0100, Leif Halvard Silli
<xn--mlform-iua at målform.no> wrote:
> I spotted a shortcoming in your testing:
>
>> I ran some utf-16 tests using 007A as input data, optionally preceded by
>> FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the
>> Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome
>> 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is
>> Opera
>> 11.60. Gecko is Nightly 12.0a1 (2011-12-26).
>>
>> HTTP BOM Trident WebKit Gecko Presto
>> utf-16 - 7A00 7A00 007A 007A
>> utf-16le - 7A00 7A00 7A00 7A00
>> utf-16be - 007A 007A 007A 007A
>
> The above test row is not complete. You should also run a BOM-less test
> using the UTF-16 label but where the 007A is represented in the
> big-endian way - a bit like I did here:
> <http://malform.no/testing/utf/#html-table-7>. The you get as result
> that Opera and Firefox do not take it for a given that files sent as
> 'utf-16' are big-endian:
>
> utf-16 - gibb* gibb* 007A 007A
>
> *gibb = gibberish/mojibake.
I get U+7A00 as I indicated above. I would not qualify that as gibberish
personally. (My table is somewhat confusing as input 007A was meant to
describe octets, but the table describes code points.)
Anyway, per
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-July/021102.html
Presto and Gecko do have some magic, but it seems better if they were the
same as Trident (and WebKit).
> That the BOM is removed from the output for utf-16be labelled files,
> means that the 'utf-16be' labelled file nevertheless is treated as
> UTF-16 (per UTF-16's specification). (Otherwise, if it had not been
> removed, the BOM character should have caused quirks mode.)
>
> Taking what you did not test for into account, it would make sense if
> 'utf-16' continues to be treated as a label under which both big-endian
> and litt-endian can be expected. And thus, that Webkit and IE starts to
> detect when UTF-16 is big-endian, but without a BOM.
I am not sure what you are trying to say here.
--
Anne van Kesteren
http://annevankesteren.nl/
More information about the whatwg
mailing list