[whatwg] Default encoding to UTF-8?

Boris Zbarsky bzbarsky at MIT.EDU
Mon Dec 5 17:31:29 PST 2011


On 12/5/11 6:14 PM, Leif Halvard Silli wrote:
> It is more likely that there is another reason, IMHO: They may have
> tried it, and found that it worked OK

Where by "it" you mean "open a text editor, type some text, and save". 
So they get whatever encoding their OS and editor defaults to.

And yes, then they find that it works ok, so they don't worry about 
encodings.

>> No.  He's describing a problem using UTF-8 to view pages that are not
>> written in English.
>
> And why is that a problem in those cases when it is a problem?

Because the characters are wrong?

> Do he read those languages, anyway?

Do you read English?  Seriously, what are you asking there, exactly?

(For the record, reading a particular page in a language is a much 
simpler task than reading the language; I can't "read German", but I can 
certainly read a German subway map.)

> The solution I proposed was that English locale browsers should default
> to UTF-8.

I know the solution you proposed.  That solution tries to avoid the 
issues David was describing by only breaking things for people in 
English browser locales, I understand that.

>> Why does it matter?  David's default locale is almost certainly en-US,
>> which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
>> actually means on the web) in his browser.  But again, he's changed the
>> default encoding from the locale default, so the locale is irrelevant.
>
> The locale is meant to predominantly be used within a physical locale.

Yes, so?

> If he is at another physical locale or a virtually other locale, he
> should not be expecting that it works out of the box unless a common
> encoding is used.

He was responding to a suggestion that the default encoding be changed 
to UTF-8 for all locales.  Are you _really_ sure you understood the 
point of his mail?

> Even today, if he visits Japan, he has to either
> change his browser settings *or* to rely on the pages declaring their
> encodings. So nothing would change, for him, when visiting Japan — with
> his browser or with his computer.

He wasn't saying it's a problem for him per se.  He's a somewhat 
sophisticated browser user who knows how to change the encoding for a 
particular page.

What he was saying is that there are lots of pages out there that aren't 
encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
and that he's run into them a bunch while traveling in particular, so 
they were not pages in English.  So far, you and he seem to agree.

> Yes, there would be a change, w.r.t. Enlgish quotation marks (see
> below) and w.r.tg. visiting Western European languages pages: For those
> a number of pages which doesn't fail with Win-1252 as the default,
> would start to fail. But relatively speaking, it is less important that
> non-English pages fail for the English locale.

No one is worried about that, particularly.

> There is a very good chance, also, that only very few of the Web pages
> for such professional institutions would fail to declare their encoding.

You'd be surprised.

>> Modulo smart quotes (and recently unicode ellipsis characters).  These
>> are actually pretty common in English text on the web nowadays, and have
>> a tendency to be in "ISO-8859-1".
>
> If we change the default, they will start to tend to be in UTF-8.

Not unless we change the authoring tools.  Half the time these things 
are just directly exported from a word processor.

> OK: Quotation marks. However, in 'old web pages', then you also find
> much more use of HTML entities (such as“) than you find today.
> We should take advantage of that, no?

I have no idea what you're trying to say,

> When you mention quotation marks, then you mention a real locale
> related issue. And may be the Euro sign too?

Not an issue for me personally, but it could be for some, yes.

> Nevertheless, the problem is smallest for languages that primarily limit their alphabet to those
> letter that are present in the American Standard Code for Information
> Interchange format.

Sure.  It may still be too big.

> It would be logical, thus, to start the switch to
> UTF-8 for those locales

If we start at all.

> Perhaps we need to have a project to measure these problems, instead of
> all these anecdotes?

Sure.  More data is always better than ancedotes.

-Boris



More information about the whatwg mailing list