[whatwg] Default encoding to UTF-8?
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Thu Dec 8 14:33:54 PST 2011
Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
> On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote:
> Mozilla grants localizers a lot of latitude here. The defaults you see
> are not carefully chosen by a committee of encoding strategists doing
> whole-Web optimization at Mozilla.
We could use such a committee for the Web!
> They are chosen by individual
> localizers. Looking at which locales default to UTF-8, I think the
> most probable explanation is that the localizers mistakenly tried to
> pick an encoding that fits the language of the localization instead of
> picking an encoding that's the most successful at decoding unlabeled
> pages most likely read by users of the localization
These localizations are nevertheless live tests. If we want to move
more firmly in the direction of UTF-8, one could ask users of those
'live tests' about their experience.
> (which means
> *other-language* pages when the language of the localization doesn't
> have a pre-UTF-8 legacy).
Do you have any concrete examples? And are there user complaints?
The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
repository correctly. As for Croat and Window-1252, then it does not
even support the Croat alphabet (in full) - I think about the digraphs.
But I'm not sure about the pre-UTF-8 legacy for Croatian.
Some language communities in Russia have a similar minority situation
as Serb Cyrillic, only that their minority script is Latin: They use
Cyrillic but they may also use Latin. But in Russia, Cyrillic
dominates. Hence it seems to be the case - according to my earlier
findings, that those few letters that, per each language, do not occur
in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used).
That way, WIN-1251 can be used for Latin with non-ASCII inside. But
given that Croat defaults to WIn-1252, they could in theory just use
NCRs too ...
Btw, for Safari on Mac, I'm unable to see any effect of switching
locale: Always Win-1252 (Latin) - it used to have effect before ... But
may be there is an parameter I'm unaware of - like Apple's knowledge of
where in the World I live ...
> I think that defaulting to UTF-8 is always a bug, because at the time
> these localizations were launched, there should have been no unlabeled
> UTF-8 legacy, because up until these locales were launched, no
> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
> UTF-8 is harmful, because it makes it possible for locale-siloed
> unlabeled UTF-8 content come to existence
The current legacy encodings nevertheless creates siloed pages already.
I'm also not sure that it would be a problem with such a UTF-8 silo:
UTF-8 is possible to detect, for browsers - Chrome seems to perform
more such detection than other browsers.
Today, perhaps especially for English users, it happens all the time
that a page - without notice - defaults with regard to encoding - and
this causes the browser - when used as an authoring tool - defaults to
Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784
(I suppose he used that browser based spec authoring tool that is in
development.)
In another message you suggested I 'lobby' against authoring tools. OK.
But the browser is also an authoring tool. So how can we have authors
output UTF-8, by default, without changing the parsing default?
> (instead of guiding all Web
> authors always to declare their use of UTF-8 so that the content works
> with all browser locale configurations).
On must guide authors to do this regardless.
> I have tried to lobby internally at Mozilla for stricter localizer
> oversight here but have failed. (I'm particularly worried about
> localizers turning the heuristic detector on by default for their
> locale when it's not absolutely needed, because that's actually
> performance-sensitive and less likely to be corrected by the user.
> Therefore, turning the heuristic detector on may do performance
> reputation damage. )
W.r.t. heuristic detector: Testing the default encoding behaviour for
Firefox was difficult. But in the end I understood that I must delete
the cached version of the Profile folder - only then would the
encodings 'fall back' properly. But before I came thus far, I tried
with the e.g. the Russian version of Firefox, and discovered that it
enabled the encoding heuristics: Thus it worked! Had it not done that,
then it would instead have used Windows-1252 as the default ... So you
perhaps need to be careful before telling them to disable heuristics ...
Btw: In Firefox, then in one sense, it is impossible to disable
"automatic" character detection: In Firefox, overriding of the encoding
only lasts until the next reload. However, I just discovered that in
Opera, this is not the case: If you select Windows-1252 in Opera, then
it will always - but online the current Tab - be Windows-1252, even if
there is a BOM and everything. In a way Opera's behaviour makes you
want to avoid setting the encoding manually in Opera.
Browsers are surprisingly different in these details ...
> (Note that zh-TW seems to be an exception to general observation that
> the locale's language has no browser-supported legacy encoding.
> However, zh-TW enables the universal heuristic encoding detector by
> default, so the fallback encoding matters less.)
Serb has - or rather: there exists - a browser-supported legacy
encoding: WIN-1251. I have not evaluated Croat properly in that regard.
Test page for checking the encoding default:
http://malform.no/testing/encodingdefault
Leif H Silli
More information about the whatwg
mailing list