[whatwg] [encoding] utf-16

Wed Dec 28 03:30:49 PST 2011

Anne van Kesteren Tue Dec 27 06:52:01 PST 2011:

I spotted a shortcoming in your testing:

> I ran some utf-16 tests using 007A as input data, optionally preceded by  
> FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the  
> Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome  
> 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is Opera  
> 11.60. Gecko is Nightly 12.0a1 (2011-12-26).
> 
> HTTP      BOM   Trident  WebKit  Gecko  Presto
> utf-16    -     7A00     7A00    007A   007A
> utf-16le  -     7A00     7A00    7A00   7A00
> utf-16be  -     007A     007A    007A   007A

The above test row is not complete. You should also run a BOM-less test 
using the UTF-16 label but where the 007A is represented in the 
big-endian way - a bit like I did here: 
<http://malform.no/testing/utf/#html-table-7>. The you get as result 
that Opera and Firefox do not take it for a given that files sent as 
'utf-16' are big-endian:

  utf-16    -     gibb*    gibb*   007A   007A

*gibb = gibberish/mojibake.

> utf-16    FFFE  7A00     7A00    7A00   7A00
> utf-16le  FFFE  7A00     7A00    7A00   7A00
> utf-16be  FFFE  7A00     7A00    FFFD*  FFFD*
> 
> utf-16    FEFF  007A     007A    007A   007A
> utf-16le  FEFF  007A     007A    FFFD** FFFD**
> utf-16be  FEFF  007A     007A    007A   007A
> 
> * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the  
> 7A. Opera decodes it as FFFD 007A.
> ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the  
> 7A. Opera decodes it as FFFD 7A00.
> 
> It seems in Trident/WebKit utf-16 and utf-16le are labels for the same  
> encoding and the BOM is more important than the encoding. Gecko and Presto  
> match existing specifications around utf-16 with different error handling  
> (afaict).
> 
> I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should  
> follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in  
> absence of a BOM. utf-16le becomes a label for utf-16. A BOM overrides the  
> direction (of utf-16 / utf-16be) and is removed from the output.

That the BOM is removed from the output for utf-16be labelled files, 
means that the 'utf-16be' labelled file nevertheless is treated as 
UTF-16 (per UTF-16's specification). (Otherwise, if it had not been 
removed, the BOM character should have caused quirks mode.)

Taking what you did not test for into account, it would make sense if 
'utf-16' continues to be treated as a label under which both big-endian 
and litt-endian can be expected. And thus, that Webkit and IE starts to 
detect when UTF-16 is big-endian, but without a BOM.
-- 
Leif H Silli