[whatwg] [encoding] utf-16

Thu Dec 29 20:51:16 PST 2011

Anne van Kesteren - Thu Dec 29 04:07:14 PST 2011
> On Thu, 29 Dec 2011 11:37:25 +0100, Leif Halvard Silli wrote:
>> Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
>>> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>>>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>>>> to handle BOM-less little-endian as well as bom-less big-endian.
>>>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>>>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
>>>
>>> Right. I think we should do it like Trident.
>>
>> To behave like Trident is quite difficult unless one applies the logic
>> that Trident does. First and foremost, the BOM must be treated the same
>> way that Trident and Webkit treat them. Secondly: It might not be be
>> desirable to behave exactly like Trident because Trident doesn't really
>> handle UTF-16 *at all* unless the file starts wtih the BOM - [...]
> 
> Yeah I noticed the weird thing with caching too. Anyway, I meant  
> WebKit/Trident.

The Trident cache behaviour is a symptom of its over all UTF-16 
behaviour: Apart from reading the BOM, it doesn't do any UTF-16 
sniffing. I suspect that you want Opera/Firefox to become "as bad" at 
'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is 
worse than IE - just to, once again, emphasize how difficult it is to 
replicate IE.) But is the little endian defaulting really important? 
Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part, 
would probably improve the situation more.

Note as well that Trident does not have the same endian problems when 
it comes to XML - for XML it tend to handle any endianness, with or 
without the BOM.

>>> I personally think everything but UTF-8 should be non-conforming,  
>>> because of the large number of gotchas embedded in the platform if you  
>>> don't use
>>> UTF-8. Anyway, it's not logical because I suggested to follow Trident
>>> which has different behavior for utf-16 and utf-16be.
>>
>> We simplify - remove a gotcha - if we say that BOM-less UTF-16 should
>> be non-conforming. From every angle, BOM-less UTF-16 as well as
>> "BOM-full" UTF-16LE and UTF-16BE, makes no sense.
> 
> That's only one. Form submission will use UTF-8 if you use UTF-16,  
> XMLHttpRequest is heavily tied to UTF-8, URLs are tied to UTF-8. Various  
> new formats such as Workers, cache manifests, WebVTT, are tied to UTF-8.  
> Using anything but UTF-8 is going to hurt and will end up confusing you  
> unless you know a shitload about encodings and the overall platform, which  
> most people don't.

I know ... And it precisely therefore that it would have been an 
advantage to, for the Web, focus on *requiring* the BOM for UTF-16. 
Make UTF-16LE/BE non-conforming to use. Because it is only with a 
reliable UTF-16 detection that the necessary 'conversion' (inside the 
UA) to UTF-8 (to the encoding of those formats you mentioned), is 
reliable. Anything with a BOM, whether UTF-16 or UTf-8, seems to go 
well together. (E.g. when I made my tests, I saw that the UTF-8 encoded 
CSS file was not used by several of the browsers - not until I made 
sure the CSS file included the BOM, then the UAs were able to get the 
CSS file to work with the UTF-16 encoded HTML files.)

I'm not a 'fan' of UTF-16. But I guess you call me a fan - a devote 
such - of the BOM.

>> You perhaps would like to see this bug, which focuses on how many
>> implementations, including XML-implementations, give precedence to the
>> BOM over other encoding declarations:
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
>>
>> *Before* paying attention to the actual encoding, you say. More
>> correct: Before deciding whether to pay attention to the 'actual'
>> encoding, they look for a BOM.
> 
> Yeah, I'm going to file a new bug so we can reconsider although the octet  
> sequence the various BOMs represent can have legitimate meanings in  
> certain encodings,

You mean: In addition to the BOM meaning, I suppose.

> it seems in practice people use them for Unicode.  
> (Helped by the fact that Trident/WebKit behave this way of course.)

Don't forget the fact that Presto/Gecko do not move the BOM into the 
<body> when you use UTF-16LE/BE, like they - per the spec of those 
encodings - should do. See: 
<http://bugzilla.validator.nu/show_bug.cgi?id=890>

More helping facts:

0 While theoretically legitimate, HTML (per HTML5) is geared towards
  UTF-8, and HTML clients are not required to support more than UTF-8.
  For that reason it seem legitimate to geared to read octets that are
  identical with the UTF-8 BOM, as the BOM. And, when UAs support the
  UTF-16 encoding, to treat the UTF-16 BOMs the same way.

1 Webkit/IE "lock" the encoding if the page contains a BOM. Thus, unlike
  Opera and Firefox, they don't allow users to override the encoding if
  the page contains the BOM. (So, instead of snowmen 
  <http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen>,
  I guess one could simply set the page's encoding via the BOM.)

2 Even for text/plain, Webkit/Presto/Trident/Gecko see the BOM as a BOM.
  (And text browsers, like Elinks - which is being used by VIM when 
  you open a file via an URL - tend to do the same, it seems.)

3 Per the UTF-16 spec, it is pretty clear that it makes little sense
  to use UTF-16LE/BE together with complete XHTML or HTML documents. 
  Because: One of the features of UTF-16LE/BE is to allow clients to
  treat the BOM in the beginning of a file *not* as a BOM but as a 
  "normal" character. However, in case of mark-up (at least HTML
  and XHTML), then the file should begin with mark-up and not with
  a character, regardless of how legitimate that character in and by
  itself could be. XML 1.0 indirectly confirms this:
   """If the replacement text of an external entity is to begin with
  the character U+FEFF, and no text declaration is present, then a 
  Byte Order Mark MUST be present, whether the entity is encoded in
  UTF-8 or UTF-16."""

4 If you chose UTF-16 in the first place, then, unlike for UTF-8, I 
  don't think you can blame your avoidance of the BOM on e.g. PHP.
  Nevertheless, if your UTF-16 file lacks the BOM, then the only thing
  that could legitimize sending as UTF-16LE/BE is this conflict we have
  between UAs that default to little endian versus those that default
  to big endian. (For that reason, the private MS 'unicode' and MS
  'unicodeFFFE' labels (which Trident/WEbkit support) actually *could*
  make some sense, if they are treated like endianness *hints*, when
  the BOM is lacking. Because unlike UTF-16LE/BE, they can include
  the BOM.)

5 The same thinking can be applied to UTF-8: non-BOM characters have
  nothing to do in a document if they occur before DOCTYPE, unless
  they indeed are the BOM.
-- 
Leif H Silli