[whatwg] [encoding] utf-16
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Fri Dec 30 22:34:14 PST 2011
Anne van Kesteren Fri, 30 Dec 2011 11:54:34 +0100
> On Fri, 30 Dec 2011 05:51:16 +0100, Leif Halvard Silli:
>> The Trident cache behaviour is a symptom of its over all UTF-16
>> behaviour: Apart from reading the BOM, it doesn't do any UTF-16
>> sniffing. I suspect that you want Opera/Firefox to become "as bad" at
>> 'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is
>> worse than IE - just to, once again, emphasize how difficult it is to
>> replicate IE.)
>
> How is WebKit worse than IE?
For HTML: If HTTP says 'WINDOWS-1252' but the page is little-endian
UTF-16 without the BOM, then IE will render the page as WINDOWS-1252,
and this will actually work - at least in some circumstances ... Check:
<http://www.acsd.k12.sc.us/wwes/>. (There could be other pages that IE
handles, but which doesn't fall into this category.)
For XHTML: For 'nude' tests
<http://malform.no/testing/utf/#xml-table-1>, then Webkit is worse than
Trident <http://malform.no/testing/utf/#xml-table-1-results>. (Trident
performs a variant of the sniffing described in XML 1.0, whereas Webkit
does not sniff at all unless there is a XML prolog.)
> And why should there be UTF-16 sniffing?
FIRST: What is 'UTF-16 sniffing'? The BOM is a sniffing form. The
HTML5 character encoding *sniffing* algorithm covers UTF-16 as well.
Should we single out UTF-16 sniffing as something that should not be
sniffed?
What do browser vendors think?
Based on the tests at <http://malform.no/testing/utf/>, then it seems
like IE performs no UTF-16 detection/sniffing beyond using HTTP, using
the BOM and - as last resort - reading the META element (including the
MS 'unicode' and MS 'unicodeFFFE' values - that Webkit also reads).
But for HTML, then Trident - unlike Webkit - does not make use of the
XML encoding declaration for detecting encoding:
<http://malform.no/testing/utf/#html-table-4>. And for HTML, then
Trident - unlike Webkit - does not make use of the the XML prolog (no,
not the encoding declaration) for sniffing the endianness of UTF-16
files: <http://malform.no/testing/utf/#html-table-9>.
Aligning with IE would mean that Opera, Mozilla and Webkit must
'degenerate' their heuristics. Why would a vendor want to become less
compatible with the Web?
>> But is the little endian defaulting really important?
>> Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part,
>> would probably improve the situation more.
>
> You mean there are sites that only work in Gecko/Presto?
'Sites' is perhaps a big word - 'UTF-16' pages are often lone pages, it
seems. But yes, obviously. E.g. big-endian UTF-16 labelled pages
without a BOM.
But, oops: It seems like Firefox does not use the META element anymore.
It used to use the META element, in Firefox 3. But apparently stopped
doing that - may be they misread the HTML5 algorithm ... Nevertheless,
I have come across pages that work in Firefox/Opera but not Trident.
MS Word, which often make these pages, cane save both big and little
endian.
>> I know ... And it precisely therefore that it would have been an
>> advantage to, for the Web, focus on *requiring* the BOM for UTF-16.
>
> It seems simpler to focus on promoting only UTF-8.
It seems simple enough say that the BOM must be used. Saying something
like that is no different from saying that a certain range of
WINDOWS-1252 must not be used, is it?
>>> Yeah, I'm going to file a new bug so we can reconsider although the
>>> octet sequence the various BOMs represent can have legitimate meanings
>>> in certain encodings,
>>
>> You mean: In addition to the BOM meaning, I suppose.
>
> No. In e.g. windows-1258 there is no BOM and FF FE simply means U+00FF
> U+20AB.
I think we have the same thing in mind. And btw, Google Search displays
many such letters in UTF-16 encoded pages ... instead of displaying the
content. Apparently, Google *fails* consider the BOM octets magic ...
Or may be it is UTF-16-negative ...
>>> it seems in practice people use them for Unicode.
>>> (Helped by the fact that Trident/WebKit behave this way of course.)
>>
>> Don't forget the fact that Presto/Gecko do not move the BOM into the
>> <body> when you use UTF-16LE/BE, like they - per the spec of those
>> encodings - should do. See:
>> <http://bugzilla.validator.nu/show_bug.cgi?id=890>
>
> Well yes, that's why I'm planning to define utf-16 more in line with
> implementations (and render the current text obsolete I suppose).
You don't need, for that reason, to follow a strategy that nullifies
UTF-16LE/UTF-16BE. I outlined another strategy: Say that all HTML pages
are interpreted as being 'UTF-16', even if they mis-labelled with the
BOM-less UTF-16LE/UTF-16BE labels.
--
Leif H Silli
More information about the whatwg
mailing list