[whatwg] [encoding] utf-16

Fri Dec 30 22:34:14 PST 2011

Anne van Kesteren  Fri, 30 Dec 2011 11:54:34 +0100
> On Fri, 30 Dec 2011 05:51:16 +0100, Leif Halvard Silli:
>> The Trident cache behaviour is a symptom of its over all UTF-16
>> behaviour: Apart from reading the BOM, it doesn't do any UTF-16
>> sniffing. I suspect that you want Opera/Firefox to become "as bad" at
>> 'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is
>> worse than IE - just to, once again, emphasize how difficult it is to
>> replicate IE.)
> 
> How is WebKit worse than IE?

For HTML: If HTTP says 'WINDOWS-1252' but the page is little-endian 
UTF-16 without the BOM, then IE will render the page as WINDOWS-1252, 
and this will actually work - at least in some circumstances ... Check: 
<http://www.acsd.k12.sc.us/wwes/>. (There could be other pages that IE 
handles, but which doesn't fall into this category.)

For XHTML: For 'nude' tests 
<http://malform.no/testing/utf/#xml-table-1>, then Webkit is worse than 
Trident <http://malform.no/testing/utf/#xml-table-1-results>. (Trident 
performs a variant of the sniffing described in XML 1.0, whereas Webkit 
does not sniff at all unless there is a XML prolog.)

> And why should there be UTF-16 sniffing?

FIRST: What is 'UTF-16 sniffing'? The BOM is a sniffing form.  The 
HTML5 character encoding *sniffing* algorithm covers UTF-16 as well. 
Should we single out UTF-16 sniffing as something that should not be 
sniffed?

     What do browser vendors think?

Based on the tests at <http://malform.no/testing/utf/>, then it seems 
like IE performs no UTF-16 detection/sniffing beyond using HTTP, using 
the BOM and - as last resort - reading the META element (including the 
MS 'unicode' and MS 'unicodeFFFE' values - that Webkit also reads). 

But for HTML, then Trident - unlike Webkit - does not make use of the 
XML encoding declaration for detecting encoding: 
<http://malform.no/testing/utf/#html-table-4>. And for HTML, then 
Trident - unlike Webkit - does not make use of the the XML prolog (no, 
not the encoding declaration) for sniffing the endianness of UTF-16 
files: <http://malform.no/testing/utf/#html-table-9>.

Aligning with IE would mean that Opera, Mozilla and Webkit must 
'degenerate' their heuristics. Why would a vendor want to become less 
compatible with the Web?

>> But is the little endian defaulting really important?
>> Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part,
>> would probably improve the situation more.
> 
> You mean there are sites that only work in Gecko/Presto?

'Sites' is perhaps a big word - 'UTF-16' pages are often lone pages, it 
seems. But yes, obviously. E.g. big-endian UTF-16 labelled pages 
without a BOM. 

But, oops: It seems like Firefox does not use the META element anymore. 
It used to use the META element, in Firefox 3. But apparently stopped 
doing that - may be they misread the HTML5 algorithm ... Nevertheless, 
I have come across pages that work in Firefox/Opera but not Trident.

MS Word, which often make these pages, cane save both big and little 
endian.

>> I know ... And it precisely therefore that it would have been an
>> advantage to, for the Web, focus on *requiring* the BOM for UTF-16.
> 
> It seems simpler to focus on promoting only UTF-8.

It seems simple enough say that the BOM must be used. Saying something 
like that is no different from saying that a certain range of 
WINDOWS-1252 must not be used, is it?

>>> Yeah, I'm going to file a new bug so we can reconsider although the  
>>> octet sequence the various BOMs represent can have legitimate meanings  
>>> in certain encodings,
>> 
>> You mean: In addition to the BOM meaning, I suppose.
> 
> No. In e.g. windows-1258 there is no BOM and FF FE simply means U+00FF  
> U+20AB.

I think we have the same thing in mind. And btw, Google Search displays 
many such letters in UTF-16 encoded pages ... instead of displaying the 
content. Apparently, Google *fails* consider the BOM octets magic ... 
Or may be it is UTF-16-negative ...

>>> it seems in practice people use them for Unicode.
>>> (Helped by the fact that Trident/WebKit behave this way of course.)
>> 
>> Don't forget the fact that Presto/Gecko do not move the BOM into the
>> <body> when you use UTF-16LE/BE, like they - per the spec of those
>> encodings - should do. See:
>> <http://bugzilla.validator.nu/show_bug.cgi?id=890>
> 
> Well yes, that's why I'm planning to define utf-16 more in line with  
> implementations (and render the current text obsolete I suppose).

You don't need, for that reason, to follow a strategy that nullifies 
UTF-16LE/UTF-16BE. I outlined another strategy: Say that all HTML pages 
are interpreted as being 'UTF-16', even if they mis-labelled with the 
BOM-less UTF-16LE/UTF-16BE labels.
-- 
Leif H Silli