[whatwg] [encoding] utf-16

Wed Dec 28 08:11:01 PST 2011

On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli  
<xn--mlform-iua at målform.no> wrote:
> Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
>> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:
>>> By "default" you supposedly mean "default, before error
>>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>>> fails to display utf-16 as often as Webkit - its defaulting behavior
>>> not withstanding - it can't be a goal to replicate that, for instance.
>>
>> Do you mean heuristics when it comes to the decoding layer? Or before
>> that? I do think any heuristics ought to be defined.
>
> Meant: While UAs may prepare for little-endian when seeing the 'utf-16'
> label, they should also be prepared for detecting it as big-endian.
>
> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
> to handle BOM-less little-endian as well as bom-less big-endian.
> Whereas if you send 'utf-16le' via HTTP, then it only accepts
> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.

Right. I think we should do it like Trident.

>>>> utf-16le becomes a label for utf-16.
>>>
>>> * Logically, utf-16be should become a label for utf-16 then, as well.
>>
>> That's not logical.
>
> Care to elaborate?
>
> To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense
> if you plan to make it non-conforming to send files with the 'utf-16'
> label unless they are little-endian encoded.

I personally think everything but UTF-8 should be non-conforming, because  
of the large number of gotchas embedded in the platform if you don't use  
UTF-8. Anyway, it's not logical because I suggested to follow Trident  
which has different behavior for utf-16 and utf-16be.

> Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if
> the ZWNBSP character at the beginning of a 'utf-16be' labelled file is
> treated as the BOM, then we do not speak about the 'utf-16be' encoding,
> but about a mislabelled 'utf-16' file.

I never spoke of any existing standard. The Unicode standard is wrong here  
for all implementations.

>> the first four bytes have special meaning.
>> That does not all suggest we should do the same for numerous other
>> encodings unrelated to utf-16.
>
> Why not? I see absolutely no difference here. When would you like to
> render a page with a BOM as anything other than what the BOM specifies?

Interesting, it does seem like Trident/WebKit look at the specific byte  
sequences the BOM has in utf-8 and utf-16 before paying attention to the  
"actual" encoding.

-- 
Anne van Kesteren
http://annevankesteren.nl/