[whatwg] Unicode as alias for UTF-16 (was Re: Default encoding to UTF-8?)
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Thu Dec 22 00:59:43 PST 2011
Henri Sivonen on Tue Dec 20 01:13:45 PST 2011:
> On Mon, Dec 19, 2011 at 9:44 PM, L. David Baron wrote:
>>> > I discovered that "UNICODE" is
>>> > used as alias for "UTF-16" in IE and Webkit.
>>> ...
>>> > Seemingly, this has not affected Firefox users too much.
>>>
>>> It surprises me greatly that Gecko doesn't treat "unicode" as an alias
>>> for "utf-16".
>>
>> Why?
>
> From playing with IE, I thought it was known that "unicode" is an
> alias for "utf-16" and it had never occurred to me to check if that
> was true in Gecko.
MS 'unicode' is only to a 50% degree (sic) an alias for 'utf-16',
namely for the *little-endian* "half" of *UTF-16*. (Thus: It is not
UTF-16LE, since MS 'unicode' usually includes the BOM.) There is also
MS 'unicodeFFFE' that represents big-endian UTF-16. See:
http://mail.apps.ietf.org/ietf/charsets/msg02030.html
>> If it's not needed, why shouldn't WebKit and IE drop it?
Actually, UTF-16 fails in Webkit much, much more often than in any
other browser. E.g. this page is (not that it related, though) labelled
as MS 'unicode': http://sacredheartbayhead.com/. Firefox, Opera and IE
all display it. But Chrome/Safari fails to detect the encoding.
So despite that Webkit aligns with IE by understanding MS 'unicode' and
MS 'unicodeFFFE', it does other things wrong when it comes to UTF-16.
So, you should only look at Webkit if you want to see how well a
browser can do in the market when it has below average UTF-16 support
... (Chrome is may be a better than Safari, though - Chrome at least
allows me to *select* UTF-16, whereas Safari does not offer UTF-16 in
its encoding menu.. Chrome also uses character set detection more
actively.)
> Needed is relative. So far, I haven't seen data about how much
> existing content there is out there that depends on this. It could be
> that some users somewhere have rejected Firefox or Opera for this and
> there just isn't enough of a feedback loop.
Feedback loop for you: In UTF-16LE or UTF-16BE pages without any other
encoding info. (The HTML5 encoding sniffing tells UAs to *do* read the
meta @charset *if* all other tests fails.) And, voila, I just now found
one such page: <http://www.hughesrenier.be/actualites.html>. This page
works fine in IE - and IE only. (That it fails in Webkit is because of
some bug in its encoding sniffing - see below.) Offline, on my
computer, when I switched the value of the meta @charset for that page
to 'UTF-16', then Firefox and Opera would also pick up the encoding.
Other pages of the same kind:
<http://www.sunsetridgebusinesspark.com/BusinessListing.html>
<http://www.rpmcmillen.com/taxes.html>
<http://www.hughesrenier.be/illustration.html>
<http://memphismitchellathletics.com/pages/2010football.html>
There are also pages like these, which works fine in IE, but which
in Firefox, if I manually select UTF-16, displays
broken-character-signs - I don't know if the UTF-16 code is buggy?:
<http://www.casamobile.org/BoardMembersStaff.html>
<http://comfortablerentals.com/Our%20Services.html>
<http://lergp.cce.cornell.edu/IPM/Home.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/regina.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/macchie.htm>
<http://web.tiscali.it/marcokiller/Mappa_del_sito.htm>
<http://familienlundorff.dk/familienLundorff.dk/genealogi/Andreas_1769/Niels_1813_Johanne_1854.html>
<http://www.prcflow.com/orifice_meter_runs_plates.htm>
<http://healthactioncenter.com/aboutus.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/mago.htm>
<http://www.trascaucristian.3x.ro/> (shows BOM sign)
<http://www.casamobile.org/history.html>
<http://www.hawkpages.com/> (See 'embedded' code on right page side)
I found them via Google, which for certain UTF-16 pages renders the
source code as search result (which make Google Search very similar to
how Webkit handles UTF-16, btw):
<http://www.google.com/search?q=%22%3Cmeta+content%3D%27text/html%3B+charset%3Dunicode%27%22>
Not the same thing, but speaking about necessity: This page declares
"UTF-8" 3 times plus that it includes the BOM. However, the HTTP
charset says ISO-8859-1, and hence ... the page fails in Firefox and
Opera, but not in Webkit and IE: <http://www.bozze.1.vg/>.
> Maybe it isn't needed, but it seems that from the WebKit or IE point
> of view, the potential upside from dropping this alias is about
> non-existent while there could be a downside. I'd expect it to be hard
> to get IE and WebKit to drop the alias.
Btw, one thing: A big source of Google findings for the search string
"<meta content='text/html; charset=unicode'" , are seems to be HTML
attachments (from MS Word users) in e-mail messages to mailing lists.
Example:
http://stsk.no/pipermail/drill-aspiranter_stsk.no/attachments/20101230/8335fbe4/attachment-0001.html
--
Leif Halvard Silli
More information about the whatwg
mailing list