[whatwg] Default encoding to UTF-8?

Henri Sivonen hsivonen at iki.fi
Tue Dec 6 23:45:11 PST 2011


On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
> Last I checked, some of those locales defaulted to UTF-8. (And HTML5
> defines it the same.) So how is that possible? Don't users of those
> locales travel as much as you do? Or do we consider the English locale
> user's as more important? Something is broken in the logics here!

Mozilla grants localizers a lot of latitude here. The defaults you see
are not carefully chosen by a committee of encoding strategists doing
whole-Web optimization at Mozilla. They are chosen by individual
localizers. Looking at which locales default to UTF-8, I think the
most probable explanation is that the localizers mistakenly tried to
pick an encoding that fits the language of the localization instead of
picking an encoding that's the most successful at decoding unlabeled
pages most likely read by users of the localization (which means
*other-language* pages when the language of the localization doesn't
have a pre-UTF-8 legacy).

I think that defaulting to UTF-8 is always a bug, because at the time
these localizations were launched, there should have been no unlabeled
UTF-8 legacy, because up until these locales were launched, no
browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
UTF-8 is harmful, because it makes it possible for locale-siloed
unlabeled UTF-8 content come to existence (instead of guiding all Web
authors always to declare their use of UTF-8 so that the content works
with all browser locale configurations).

I have tried to lobby internally at Mozilla for stricter localizer
oversight here but have failed. (I'm particularly worried about
localizers turning the heuristic detector on by default for their
locale when it's not absolutely needed, because that's actually
performance-sensitive and less likely to be corrected by the user.
Therefore, turning the heuristic detector on may do performance
reputation damage. )

(Note that zh-TW seems to be an exception to general observation that
the locale's language has no browser-supported legacy encoding.
However, zh-TW enables the universal heuristic encoding detector by
default, so the fallback encoding matters less.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/


More information about the whatwg mailing list