[whatwg] Default encoding to UTF-8?

Fri Dec 9 05:34:08 PST 2011

On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
> Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
> These localizations are nevertheless live tests. If we want to move
> more firmly in the direction of UTF-8, one could ask users of those
> 'live tests' about their experience.

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995

>> (which means
>> *other-language* pages when the language of the localization doesn't
>> have a pre-UTF-8 legacy).
>
> Do you have any concrete examples?

The example I had in mind was Welsh.

> And are there user complaints?

Not that I know of, but I'm not part of a feedback loop if there even
is a feedback loop here.

> The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
> Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
> repository correctly.

OS-dependent differences are *very* suspicious. :-(

>> I think that defaulting to UTF-8 is always a bug, because at the time
>> these localizations were launched, there should have been no unlabeled
>> UTF-8 legacy, because up until these locales were launched, no
>> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
>> UTF-8 is harmful, because it makes it possible for locale-siloed
>> unlabeled UTF-8 content come to existence
>
> The current legacy encodings nevertheless creates siloed pages already.
> I'm also not sure that it would be a problem with such a UTF-8 silo:
> UTF-8 is possible to detect, for browsers - Chrome seems to perform
> more such detection than other browsers.

While UTF-8 is possible to detect, I really don't want to take Firefox
down the road where users who currently don't have to suffer page load
restarts from heuristic detection have to start suffering them. (I
think making incremental rendering any less incremental for locales
that currently don't use a detector is not an acceptable solution for
avoiding restarts. With English-language pages, the UTF-8ness might
not be apparent from the first 1024 bytes.)

> In another message you suggested I 'lobby' against authoring tools. OK.
> But the browser is also an authoring tool.

In what sense?

> So how can we have authors
> output UTF-8, by default, without changing the parsing default?

Changing the default is an XML-like solution: creating breakage for
users (who view legacy pages) in order to change author behavior.

To the extent a browser is a tool Web authors use to test stuff, it's
possible to add various whining to console without breaking legacy
sites for users. See
https://bugzilla.mozilla.org/show_bug.cgi?id=672453
https://bugzilla.mozilla.org/show_bug.cgi?id=708620

> Btw: In Firefox, then in one sense, it is impossible to disable
> "automatic" character detection: In Firefox, overriding of the encoding
> only lasts until the next reload.

A persistent setting for changing the fallback default is in the
"Advanced" subdialog of the font prefs in the "Content" preference
pane. It's rather counterintuitive that the persistent autodetection
setting is in the same menu as the one-off override.

As for heuristic detection based on the bytes of the page, the only
heuristic that can't be disabled is the heuristic for detecting
BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was
believed to have been giving that sort of files to their customers and
it "worked" in pre-HTML5 browsers that silently discarded all zero
bytes prior to tokenization.) The Cyrillic and CJK detection
heuristics can be turned on and off by the user.

Within an origin, Firefox considers the parent frame and the previous
document in the navigation history as sources of encoding guesses.
That behavior is not user-configurable to my knowledge.

Firefox also remembers the encoding from previous visits as long as
Firefox otherwise has the page in cache. So for testing, it's
necessary to make Firefox forget about previous visits to the test
case.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/