[whatwg] Default encoding to UTF-8?
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Mon Dec 5 09:42:39 PST 2011
L. David Baron on Wed Nov 30 18:29:31 PST 2011:
> On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
>> My understanding is that all browsers* default to Western Latin
>> (ISO-8859-1) encoding by default (for Western-world
>> downloads/OSes) due to legacy content on the web. But how relevant
>> is that still today? Has any browser done any recent research into
>> the need for this?
>
> The default varies by localization (and within that potentially by
> platform), and unfortunately that variation does matter. You can
> see Firefox's defaults here:
> http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
> (The localization and platform are part of the filename.)
Last I checked, some of those locales defaulted to UTF-8. (And HTML5
defines it the same.) So how is that possible? Don't users of those
locales travel as much as you do? Or do we consider the English locale
user's as more important? Something is broken in the logics here!
> I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
> (by changing the "intl.charset.default" preference), and I do see a
> decent amount of broken content as a result (maybe I encounter a new
> broken page once a week? -- though substantially more often if I'm
> looking at non-English pages because of travel).
What kind of trouble are you actually describing here? You are
describing a problem with using UTF-8 for *your locale*. What is your
locale? It is probably English. Or do you consider your locale to be
'the Western world locale'? It sounds like *that* is what Anne has in
mind when he brings in Dutch:
http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as
if some see Latin-1 - or Windows-1251 as we now should say - as a
'super default' rather than a locale default. If that is the case, that
it is a super default, then we should also spec it like that! Until
further, I'll treat Latin-1 as it is specced: As a default for certain
locales.)
Since it is a locale problem, we need to understand which locale you
have - and/or which locale you - and other debaters - think they have.
Faruk probably uses a Spanish locale - right?, so the two of you are
not speaking out of the same context.
However, you also say that your problem is not so much related to pages
written for *your* locale as it is related for pages written for users
of *other* locales. So how many times per year do Dutch, Spanish or
Norwegian - and other non-English pages - are creating troubles for
you, as a English locale user? I am making an assumption: Almost never.
You don't read those languages, do you?
This is also an expectation thing: If you visit a Russian page in a
legacy Cyrillic encoding, and gets mojibake because your browser
defaults to Latin-1, then what does it matter to you whether your
browser defaults to Latin-1 or UTF-8? Answer: Nothing.
>> I'm wondering if it might not be good to start encouraging
>> defaulting to UTF-8, and only fallback to Western Latin if it is
>> detected that the content is very old / served by old
>> infrastructure or servers, etc. And of course if the content is
>> served with an explicit encoding of Western Latin.
>
> The more complex the rules, the harder they are for authors to
> understand / debug. I wouldn't want to create rules like those.
Agree that that particular idea is probably not the best.
> I would, however, like to see movement towards defaulting to UTF-8:
> the current situation makes the Web less world-wide because pages
> that work for one user don't work for another.
>
> I'm just not quite sure how to get from here to there, though, since
> such changes are likely to make users experience broken content.
I think we should 'attack' the dominating locale first: The English
locale, in its different incarnations (Australian, American, UK). Thus,
we should turn things on the head: English users should start to expect
UTF-8 to be used. Because, as English users, you are more used to
'mojibake' than the rest of us are: Whenever you see it, you 'know'
that it is because it is a foreign language you are reading. It is we,
the users of non-English locales, that need the default-to-legacy
encoding behavior the most. Or, please, explain to us when and where it
is important that English language users living in their own, native
lands so to speak, need that their browser default to Latin-1 so that
they can correctly read English language pages?
If the English locales start defaulting to UTF-8, then little by
little, the same expectation etc will start spreading to the other
locales as well, not least because the 'geeks' of each locale will tend
to see the English locale as a super default - and they might also use
the US English locale of their OS and/or browser. We should not
consider the needs of geeks - they will follow (read: lead) the way, so
the fact that *they* may see mojibake, should not be a concern.
See? We would have a plan. Or what do you think? Of course, we - or
rather: the browser vendors - would need to market this as an important
change. The HTML5 spec already justifies the use of UTF-8 several
places - it says that pages might not work as expected e.g. w.r.t.
URLs, unless UTF-8 is used. So there are enough of arguments that can
be used.
There are other technical ideas I have, such as treating the BOM the
way Webkit and IE treats it - that would increase the number of pages
treated as UTF-8 by all browsers a little bit [1]. However that can
wait or whatever: The most important thing is to *initiate* the default
encoding change.
[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
Leif Halvard Silli
More information about the whatwg
mailing list