[whatwg] Default encoding to UTF-8?

Tue Dec 6 05:59:27 PST 2011

(2011/12/06 17:39), Jukka K. Korpela wrote:
> 2011-12-06 6:54, Leif Halvard Silli wrote:
> 
>> Yeah, it would be a pity if it had already become an widespread
>> cargo-cult to - all at once - use HTML5 doctype without using UTF-8
>> *and* without using some encoding declaration *and* thus effectively
>> relying on the default locale encoding ... Who does have a data corpus?

I found it: http://rink77.web.fc2.com/html/metatagu.html
It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS,
the default encoding of Japanese locale.

> Since <!doctype html> is the simplest way to put browsers to "standards mode", this would punish authors who have realized that their page works better in "standards mode" but are unaware of a completely different and fairly complex problem. (Basic character encoding issues are of course not that complex to you and me or most people around here; but most authors are more or less confused with them, and I don't think we should add to the confusion.)

I don't think there is a page works better in "standards mode" than *current* loose mode.

> There's a little point in changing the specs to say something very different from what previous HTML specs have said and from actual browser behavior. If the purpose is to make things more exactly defined (a fixed encoding vs. implementation-defined), then I think such exactness is a luxury we cannot afford. Things would be all different if we were designing a document format from scratch, with no existing implementations and no existing usage. If the purpose is UTF-8 evangelism, then it would be just the kind of evangelism that produces angry people, not converts.

Agreed, if we design new spec, there's no reason to choose other than UTF-8.
But HTML has long history and many content.
We already have HTML*5* pages which doesn't have encoding declaration.

> If there's something that should be added to or modified in the algorithm for determining character encoding, the I'd say it's error processing. I mean user agent behavior when it detects, after running the algorithm, when processing the document data, that there is a mismatch between them. That is, that the data contains octets or octet sequences that are not allowed in the encoding or that denote noncharacters. Such errors are naturally detected when the user agent processes the octets; the question is what the browser should do then.

Current implementations replaces such an invalid octet with a replacement character.
Or some implementations scans almost the page and uses an encoding
with which all octets in the page are valid.

> When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as UTF-8 	encoded, then, if the data contains octets outside the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. The converse error may also cause character-level errors. And these are not uncommon situations - they seem occur increasingly often, partly due to cargo cult "use of UTF-8" (when it means declaring UTF-8 but not actually using it, or vice versa), partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 encoded data.

In such case, the page should be failed to show on the author's environment.

> From the user's point of view, the character-level errors currently result is some gibberish (e.g., some odd box appearing instead of a character, in one place) or in total mess (e.g. a large number non-ASCII characters displayed all wrong). In either case, I think an error should be signalled to the user, together with
> a) automatically trying another encoding, such as the locale default encoding instead of UTF-8 or UTF-8 instead of anything else
> b) suggesting to the user that he should try to view the page using some other encoding, possibly with a menu of encodings offered as part of the error explanation
> c) a combination of the above.

This premises that a user know the correct encoding.
But European people really know the correct encoding of ISO-8859-* pages?
I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and ISO-8859-2 page.

> Although there are good reasons why browsers usually don't give error messages, this would be a special case. It's about the primary interpretation of the data in the document and about a situation where some data has no interpretation in the assumed encoding - but usually has an interpretation in some other encoding.

Some browsers alerts scripting issues.
Why they cannot alerts an encoding issue?

> The current "Character encoding overrides" rules are questionable because they often mask out data errors that would have helped to detect problems that can be solved constructively. For example, if data labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is actually windows-1252 encoded and the "override" helps everyone. But it may also be the case that the data is in a different encoding and that the "override" therefore results in gibberish shown to the user, with no hint of the cause of the problem.

I think such case doesn't exist.
On character encoding overrides a superset overrides a standard set.
So I can't imagine the case.

> It would therefore be better to signal a problem to the user, display the page using the windows-1252 encoding but with some instruction or hint on changing the encoding. And a browser should in this process really analyze whether the data can be windows-1252 encoded data that contains only characters permitted in HTML.

Such verification should be done by developer tools, not production browsers
which is widely used by real users.

-- 
NARUSE, Yui  <naruse at airemix.jp>