[whatwg] [encoding] utf-16
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Tue Dec 27 18:20:26 PST 2011
Hi Anne. Over all, your findings corresponds with mine, which are based
on <http://malform.no/testing/utf/>. I also agree with the direction of
the conclusions, but I would like the encodings document to make some
distinctions that it currently doesn't - and which you have not
proposed either. See below.
Anne van Kesteren wrote:
[ snip ]
> I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should
follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in
absence of a BOM.
By "default" you supposedly mean "default, before error
handling/heuristic detection". Relevance: On the "real" Web, no browser
fails to display utf-16 as often as Webkit - its defaulting behavior
not withstanding - it can't be a goal to replicate that, for instance.
> utf-16le becomes a label for utf-16.
* Logically, utf-16be should become a label for utf-16 then, as well.
Is that what you suggest? Because, if the BOM can change the meaning of
utf-16be, then it makes sense to treat the utf-16be label as well as
the utf-16le label as synonymous with utf-16. (Thus, effectively
utf-16le and utf-16be becomes defunct/unreliable on the Web.)
* W.r.t. making utf-16le and utf-16be into 'label[s] for utf-16', then
OK, when it comes to how UAs should *treat* them. But I suppose it
should not be considered conforming to *send* the UTF-16LE/UTF-16BE
labels with text/html, due to their ambiguous status on the Web. Rather
it should only be conforming to send 'utf-16'.
> A BOM overrides the direction (of utf-16 / utf-16be) and is removed from
the output.
FIRSTLY: Another way to see this is to say: IE and Webkit do not
support 'utf-16le' or 'utf-16be' - they only support 'utf-16' but
defaults to little endian rather than big endian when BOM is omitted.
When bom is "omitted" for utf-16be, then they default to big endian,
making 'utf-16le' an alias of Microsoft's private 'unicode' label and
'utf-16be' an alias of Microsoft's private 'unicodeFFFE' label (which
each of them uses the BOM). In other words: On the Web, then 'utf-16le'
and 'utf-16be' becomes de-facto synonyms for MS 'unicode' and MS
'unicodefffe'.
SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM
should override the HTTP level charset info. OK. But then you should go
the full way, and give the BOM the same, overriding authority when it
comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type
header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file
itself contains a BOM (that contradicts the HTTP info), then the BOM
"wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the
X-Content-Type: header has no effect w.r.t. treating the HTTP's charset
info as authoritative - the BOM wins even then.)
Summary: It makes no sense to treat the BOM as winning over the HTTP
level charset parameter *only* for UTF-16 - the UTF-8 BOM must have
same overriding effect as well. Thus: Any encoding info from the header
would be overridden by the BOM. Of course, a documents where the BOM
contradicts the HTTP charset, should not be considered conforming. But
the UA treatment of them should still be uniform.
(PS: If you insert the BOM as  before <!DOCTYPE html>, then IE
will use UTF-8, when it loads the page from cache. Just say'in.)
--
Leif H Silli
More information about the whatwg
mailing list