[whatwg] Default encoding to UTF-8?

Mon Dec 5 18:55:00 PST 2011

On 12/5/11 6:14 PM, Leif Halvard Silli wrote:
>> It is more likely that there is another reason, IMHO: They may have
>> tried it, and found that it worked OK
> 
> Where by "it" you mean "open a text editor, type some text, and save". 
> So they get whatever encoding their OS and editor defaults to.

If that is all they tested, then I'd said they did not test enough.

> And yes, then they find that it works ok, so they don't worry about 
> encodings.

Ditto.

>>> No.  He's describing a problem using UTF-8 to view pages that are not
>>> written in English.
>>
>> And why is that a problem in those cases when it is a problem?
> 
> Because the characters are wrong?

But the characters will be wrong many more times than exactly those 
times when he tries to read a  Web page with a Western European 
languages that is not declared as WIN-1252. Does English locale uses 
have particular expectations with regard to exactly those Web pages? 
What about Polish Web pages etc? English locale users is a very 
multiethnic lot.

>> Do he read those languages, anyway?
> 
> Do you read English?  Seriously, what are you asking there, exactly?

Because if it is an issue, then it is an about expectations for exactly 
those pages. (Plus the quote problem, of course.)

> (For the record, reading a particular page in a language is a much 
> simpler task than reading the language; I can't "read German", but I can 
> certainly read a German subway map.)

Or Polish subway map - which doesn't default to said encoding.

>> The solution I proposed was that English locale browsers should default
>> to UTF-8.
> 
> I know the solution you proposed.  That solution tries to avoid the 
> issues David was describing by only breaking things for people in 
> English browser locales, I understand that.

That characterization is only true with regard to the quote problem. 
That German pages "breaks" would not be any more important than the 
fact that Polish pages would. For that matter: It happens that UTF-8 
pages breaks as well.

I only suggest it as a first step, so to speak. Or rather - since some 
locales apparently already default to UTF-9 - as a next step. 
Thereafter, more locales would be expected to follow suit - as the 
development of each locale permits.

>>> Why does it matter?  David's default locale is almost certainly en-US,
>>> which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
>>> actually means on the web) in his browser.  But again, he's changed the
>>> default encoding from the locale default, so the locale is irrelevant.
>>
>> The locale is meant to predominantly be used within a physical locale.
> 
> Yes, so?

So then we have a set of expectations for the language of that locale. 
If we look at how the locale settings handles other languages, then we 
are outside the issue that the locale specific encodings are supposed 
to handle.

>> If he is at another physical locale or a virtually other locale, he
>> should not be expecting that it works out of the box unless a common
>> encoding is used.
> 
> He was responding to a suggestion that the default encoding be changed 
> to UTF-8 for all locales.  Are you _really_ sure you understood the 
> point of his mail?

I said I agreed with him that Faruk's solution was not good. However, I 
would not be against treating <DOCTYPE html> as a 'default to UTF-8' 
declaration, as suggested by some - if it were possible to agree about 
that. Then we could keep things as they are, except for the HTML5 
DOCTYPE. I guess the HTML5 doctype would become 'the default before the 
default': If everything else fails, then UTF-8 if the DOCTYPE is 
<!DOCTYPE html>, or else, the locale default.

It sounded like Darin Adler thinks it possible. How about you?

>> Even today, if he visits Japan, he has to either
>> change his browser settings *or* to rely on the pages declaring their
>> encodings. So nothing would change, for him, when visiting Japan — with
>> his browser or with his computer.
> 
> He wasn't saying it's a problem for him per se.  He's a somewhat 
> sophisticated browser user who knows how to change the encoding for a 
> particular page.

If we are talking about English locale user visiting Japan, then I 
doubt a change in the default encoding would matter - Win-1252 as 
default would anyway be wrong.

> What he was saying is that there are lots of pages out there that aren't 
> encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
> and that he's run into them a bunch while traveling in particular, so 
> they were not pages in English.  So far, you and he seem to agree.

So far we agree, yes.

>> Yes, there would be a change, w.r.t. Enlgish quotation marks (see
>> below) and w.r.tg. visiting Western European languages pages: For those
>> a number of pages which doesn't fail with Win-1252 as the default,
>> would start to fail. But relatively speaking, it is less important that
>> non-English pages fail for the English locale.
> 
> No one is worried about that, particularly.

You spoke about visiting German pages above - sounded like you worried, 
but may be I misunderstood. Also, remember, that Anne mentioned Dutch 
in the blog post. (Per my proposal, the Dutch locale should to begin 
with not be affected - no locale should switch until it is 'ready 
enough'.)

>> There is a very good chance, also, that only very few of the Web pages
>> for such professional institutions would fail to declare their encoding.
> 
> You'd be surprised.

I probably would. 

>>> Modulo smart quotes (and recently unicode ellipsis characters).  These
>>> are actually pretty common in English text on the web nowadays, and have
>>> a tendency to be in "ISO-8859-1".
>>
>> If we change the default, they will start to tend to be in UTF-8.
> 
> Not unless we change the authoring tools.  Half the time these things 
> are just directly exported from a word processor.

Please educate me. I'm perhaps 'handicapped' in that regard: I haven't 
used MS Word on a regular basis since MS Word 5.1 for Mac. Also, if 
"export" means "copy and paste", then on the Mac, everything gets 
converted via the clipboard: You can paste from a Latin-1 page to a 
UTF-8 page and vice-versa - it used to be like that, I think, even in 
Mac OS 9. I also thought that people are moving more and more to 'the 
cloud'. 

>> OK: Quotation marks. However, in 'old web pages', then you also find
>> much more use of HTML entities (such as“) than you find today.
>> We should take advantage of that, no?
> 
> I have no idea what you're trying to say,

Sorry. What I meant was that character entities are encoding 
independent. And that lots of people - and authoring tools - have 
inserted non-ASCII letters and characters as character entities, 
especially in the olden days - it was considered the HTML way to do it. 
People thought character entities and lots of encodings were great 
things! (So did I - w.r.t. encoding.) At any rate: A page which uses 
character entities for non-ascii would render the same regardless of 
encoding, hence a switch to UTF-8 would not matter for those.

>> When you mention quotation marks, then you mention a real locale
>> related issue. And may be the Euro sign too?
> 
> Not an issue for me personally, but it could be for some, yes.
> 
>> Nevertheless, the problem is smallest for languages that primarily 
> limit their alphabet to those
>> letter that are present in the American Standard Code for Information
>> Interchange format.
> 
> Sure.  It may still be too big.
> 
>> It would be logical, thus, to start the switch to
>> UTF-8 for those locales
> 
> If we start at all.

Of course. 

>> Perhaps we need to have a project to measure these problems, instead of
>> all these anecdotes?
> 
> Sure.  More data is always better than ancedotes.

One should have some tool like they have for observing TV habits ... 
where the user agree to be observed etc ...
-- 
Leif H Silli