[whatwg] [encoding] utf-16

Fri Dec 30 02:54:34 PST 2011

On Fri, 30 Dec 2011 05:51:16 +0100, Leif Halvard Silli  
<xn--mlform-iua at målform.no> wrote:
> The Trident cache behaviour is a symptom of its over all UTF-16
> behaviour: Apart from reading the BOM, it doesn't do any UTF-16
> sniffing. I suspect that you want Opera/Firefox to become "as bad" at
> 'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is
> worse than IE - just to, once again, emphasize how difficult it is to
> replicate IE.)

How is WebKit worse than IE? And why should there be UTF-16 sniffing?

> But is the little endian defaulting really important?
> Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part,
> would probably improve the situation more.

You mean there are sites that only work in Gecko/Presto?

> I know ... And it precisely therefore that it would have been an
> advantage to, for the Web, focus on *requiring* the BOM for UTF-16.

It seems simpler to focus on promoting only UTF-8.

>> Yeah, I'm going to file a new bug so we can reconsider although the  
>> octet sequence the various BOMs represent can have legitimate meanings  
>> in
>> certain encodings,
>
> You mean: In addition to the BOM meaning, I suppose.

No. In e.g. windows-1258 there is no BOM and FF FE simply means U+00FF  
U+20AB.

>> it seems in practice people use them for Unicode.
>> (Helped by the fact that Trident/WebKit behave this way of course.)
>
> Don't forget the fact that Presto/Gecko do not move the BOM into the
> <body> when you use UTF-16LE/BE, like they - per the spec of those
> encodings - should do. See:
> <http://bugzilla.validator.nu/show_bug.cgi?id=890>

Well yes, that's why I'm planning to define utf-16 more in line with  
implementations (and render the current text obsolete I suppose).

-- 
Anne van Kesteren
http://annevankesteren.nl/