[whatwg] [encoding] utf-16
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Thu Dec 29 02:37:25 PST 2011
Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>> Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
>>> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:
>>>> By "default" you supposedly mean "default, before error
>>>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>>>> fails to display utf-16 as often as Webkit - its defaulting behavior
>>>> not withstanding - it can't be a goal to replicate that, for instance.
>>> Do you mean heuristics when it comes to the decoding layer? Or before
>>> that? I do think any heuristics ought to be defined.
>> Meant: While UAs may prepare for little-endian when seeing the 'utf-16'
>> label, they should also be prepared for detecting it as big-endian.
>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>> to handle BOM-less little-endian as well as bom-less big-endian.
>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
> Right. I think we should do it like Trident.
To behave like Trident is quite difficult unless one applies the logic
that Trident does. First and foremost, the BOM must be treated the same
way that Trident and Webkit treat them. Secondly: It might not be be
desirable to behave exactly like Trident because Trident doesn't really
handle UTF-16 *at all* unless the file starts wtih the BOM - just run
this test to verify:
1) visit this test suite with IE:
2) Click yourself through 7 pages in the test, until the
last, 'UTF-16' labelled, big-endian, BOM-less page
(which causes mojibake in IE).
3) Now, use the Back (or Forward) button to go backward
(or Forward) page by page. (You will even be able
see the last, mojibake-ish page, if you use the
Forward button to visit it.)
RESULT: 4 of the 7 files in the test - namely: the UTF-16 files without
a BOM - fail when IE pulls them from cache. When loaded from cache, the
non-ASCII letters becomes destructed. Note especially that it doesn't
matter whether the file is big endian or little endian!
Surely, this is not something that we would like UAs to replicate.
Conclusions: a) BOM-less UTF-16 should simply be considered
non-conforming on the Web, if Trident is the standard. b) there is no
need to consider what Trident do with BOM-less files as conforming,
irrespective of whether the page is big endian or little endian. (That
it handles little-endian BOM-less files a little better than big-endian
BOM-less files, is just a marginal advantage.)
>>>>> utf-16le becomes a label for utf-16.
>>>> * Logically, utf-16be should become a label for utf-16 then, as well.
>>> That's not logical.
>> Care to elaborate?
>> To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense
>> if you plan to make it non-conforming to send files with the 'utf-16'
>> label unless they are little-endian encoded.
> I personally think everything but UTF-8 should be non-conforming, because
> of the large number of gotchas embedded in the platform if you don't use
> UTF-8. Anyway, it's not logical because I suggested to follow Trident
> which has different behavior for utf-16 and utf-16be.
We simplify - remove a gotcha - if we say that BOM-less UTF-16 should
be non-conforming. From every angle, BOM-less UTF-16 as well as
"BOM-full" UTF-16LE and UTF-16BE, makes no sense.
>> Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if
>> the ZWNBSP character at the beginning of a 'utf-16be' labelled file is
>> treated as the BOM, then we do not speak about the 'utf-16be' encoding,
>> but about a mislabelled 'utf-16' file.
> I never spoke of any existing standard. The Unicode standard is wrong here
> for all implementations.
Here, at least, you do speak about an existing standard ... It is
exactly my point that the browsers don't interpret UTF-16be/le as
UTF-16be/le but more like UTF-16. But in which why, exactly, is UTF-16
not specified correctly, you mean?
>>> the first four bytes have special meaning.
>>> That does not all suggest we should do the same for numerous other
>>> encodings unrelated to utf-16.
>> Why not? I see absolutely no difference here. When would you like to
>> render a page with a BOM as anything other than what the BOM specifies?
> Interesting, it does seem like Trident/WebKit look at the specific byte
> sequences the BOM has in utf-8 and utf-16 before paying attention to the
> "actual" encoding.
You perhaps would like to see this bug, which focuses on how many
implementations, including XML-implementations, give precedence to the
BOM over other encoding declarations:
*Before* paying attention to the actual encoding, you say. More
correct: Before deciding whether to pay attention to the 'actual'
encoding, they look for a BOM.
Leif Halvard Silli
More information about the whatwg