[whatwg] Default encoding to UTF-8?

Tue Dec 6 16:36:26 PST 2011

Jukka K. Korpela Tue Dec 6 13:27:11 PST 2011
> 2011-12-06 22:58, Leif Halvard Silli write:
> 
>> There is now a bug, and the editor says the outcome depends on "a
>> browser vendor to ship it":
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=15076
>>
>> Jukka K. Korpela Tue Dec 6 00:39:45 PST 2011
>>
>>> what is this proposed change to defaults supposed to achieve. […]
>>
>> I'd say the same as in XML: UTF-8 as a reliable, common default.
> 
> The "bug" was created so that the argument given was:
> "It would be nice to minimize number of declarations a page needs to 
> include."

I just wanted to cite Kornel's original statement. But just because 
Kornel cited an authoring use case does not mean that it doesn't have 
other use cases. This entire thread started with a user problem. Also, 
the entire HTML5 argues in favour of UTF-8, so that seemed not so 
important to justify more.

> That is, author convenience - so that authors could work sloppily and 
> produce documents that could fail on user agents that haven't 
> implemented this change.

There already is locales where UTF-8 is the default, and the fact that 
this could benefit some sloppy authors within those locales, is not an 
relevant argument against it. In the Western-European locales, one can 
make documents that fail on UAs which doesn't operate within our 
locales. Thus, either way, some sloppy authors will "benefit" ... But 
with the proposed change, then even users *outside* the locales that 
share the default encoding of the sloppy author's locale, would benefit.

> This sounds more absurd than I can describe.
> 
> XML was created as a new data format; it was an entirely different issue.

HTML5 includes some features that are meant to benefit "jumping" back 
and forth between HTML and XML, and this features would and one more 
such feature.

>>> If there's something that should be added to or modified in the
>>> algorithm for determining character encoding, the I'd say it's error
>>> processing. I mean user agent behavior when it detects, [...]
>>
>> There is already an (optional) detection step in the algorithm - but UA
>> treat that step differently, it seems.
> 
> I'm afraid I can't find it - I mean the treatment of a document for 
> which some encoding has been deduced (say, directly from HTTP headers) 
> and which then turns out to violate the rules of the encoding.

Sorry, I thought you meant a document where there were no meta data 
about the encoding available - (as  described in step 7 - 'attempt to 
auto-detect' etc).

Leif H Silli