[whatwg] Default encoding to UTF-8?

Thu Dec 8 14:33:54 PST 2011

Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
> On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote:

> Mozilla grants localizers a lot of latitude here. The defaults you see
> are not carefully chosen by a committee of encoding strategists doing
> whole-Web optimization at Mozilla.

We could use such a committee for the Web!

> They are chosen by individual
> localizers. Looking at which locales default to UTF-8, I think the
> most probable explanation is that the localizers mistakenly tried to
> pick an encoding that fits the language of the localization instead of
> picking an encoding that's the most successful at decoding unlabeled
> pages most likely read by users of the localization

These localizations are nevertheless live tests. If we want to move 
more firmly in the direction of UTF-8, one could ask users of those 
'live tests' about their experience.

> (which means
> *other-language* pages when the language of the localization doesn't
> have a pre-UTF-8 legacy).

Do you have any concrete examples? And are there user complaints?

The Serb localization uses UTF-8. The Croat uses Win-1252, but only on 
Windows and Mac: On Linux it appears to use UTF-8, if I read the HG 
repository correctly. As for Croat and Window-1252, then it does not 
even support the Croat alphabet (in full) - I think about the digraphs. 
But I'm not sure about the pre-UTF-8 legacy for Croatian.

Some language communities in Russia have a similar minority situation 
as Serb Cyrillic, only that their minority script is Latin: They use 
Cyrillic but they may also use Latin. But in Russia, Cyrillic 
dominates. Hence it seems to be the case - according to my earlier 
findings, that those few letters that, per each language, do not occur 
in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used). 
That way, WIN-1251 can be used for Latin with non-ASCII inside. But 
given that Croat defaults to WIn-1252, they could in theory just use 
NCRs too ...

Btw, for Safari on Mac, I'm unable to see any effect of switching 
locale: Always Win-1252 (Latin) - it used to have effect before ... But 
may be there is an parameter I'm unaware of - like Apple's knowledge of 
where in the World I live ...

> I think that defaulting to UTF-8 is always a bug, because at the time
> these localizations were launched, there should have been no unlabeled
> UTF-8 legacy, because up until these locales were launched, no
> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
> UTF-8 is harmful, because it makes it possible for locale-siloed
> unlabeled UTF-8 content come to existence

The current legacy encodings nevertheless creates siloed pages already. 
I'm also not sure that it would be a problem with such a UTF-8 silo: 
UTF-8 is possible to detect, for browsers - Chrome seems to perform 
more such detection than other browsers.

Today, perhaps especially for English users, it happens all the time 
that a page - without notice - defaults with regard to encoding - and 
this causes the browser - when used as an authoring tool - defaults to 
Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784 
(I suppose he used that browser based spec authoring tool that is in 
development.) 

In another message you suggested I 'lobby' against authoring tools. OK. 
But the browser is also an authoring tool. So how can we have authors 
output UTF-8, by default, without changing the parsing default?

> (instead of guiding all Web
> authors always to declare their use of UTF-8 so that the content works
> with all browser locale configurations).

On must guide authors to do this regardless.

> I have tried to lobby internally at Mozilla for stricter localizer
> oversight here but have failed. (I'm particularly worried about
> localizers turning the heuristic detector on by default for their
> locale when it's not absolutely needed, because that's actually
> performance-sensitive and less likely to be corrected by the user.
> Therefore, turning the heuristic detector on may do performance
> reputation damage. )

W.r.t. heuristic detector: Testing the default encoding behaviour for 
Firefox was difficult. But in the end I understood that I must delete 
the cached version of the Profile folder - only then would the 
encodings 'fall back' properly. But before I came thus far, I tried 
with the e.g. the Russian version of Firefox, and discovered that it 
enabled the encoding heuristics: Thus it worked! Had it not done that, 
then it would instead have used Windows-1252 as the default ... So you 
perhaps need to be careful before telling them to disable heuristics ...

Btw: In Firefox, then in one sense, it is impossible to disable 
"automatic" character detection: In Firefox, overriding of the encoding 
only lasts until the next reload. However, I just discovered that in 
Opera, this is not the case: If you select Windows-1252 in Opera, then 
it will always - but online the current Tab -  be Windows-1252, even if 
there is a BOM and everything. In a way Opera's behaviour makes you 
want to avoid setting the encoding manually in Opera.

Browsers are surprisingly different in these details ...

> (Note that zh-TW seems to be an exception to general observation that
> the locale's language has no browser-supported legacy encoding.
> However, zh-TW enables the universal heuristic encoding detector by
> default, so the fallback encoding matters less.)

Serb has - or rather: there exists - a browser-supported legacy 
encoding: WIN-1251. I have not evaluated Croat properly in that regard. 

Test page for checking the encoding default: 
http://malform.no/testing/encodingdefault

Leif H Silli