[whatwg] Default encoding to UTF-8?
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Sun Dec 11 03:21:40 PST 2011
Henri Sivonen Fri Dec 9 05:34:08 PST 2011:
> On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli:
>> Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
>> These localizations are nevertheless live tests. If we want to move
>> more firmly in the direction of UTF-8, one could ask users of those
>> 'live tests' about their experience.
>
> Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995
This is brilliant. Looking forward to the results!
>>> (which means
>>> *other-language* pages when the language of the localization doesn't
>>> have a pre-UTF-8 legacy).
>>
>> Do you have any concrete examples?
>
> The example I had in mind was Welsh.
Logical candidate. WHat do you know about the Farsi and Arabic local?
HTML5 specifies UTF-8 for them - due to the way Firefox behaves, I
think. IE seems to be the big dominator for these locales - at least in
Iran. Firefox was number two in Iran, but still only at around 5
percent, in the stats I saw.
Btw, as I looked into Iran a bit ... I discovered that "UNICODE" is
used as alias for "UTF-16" in IE and Webkit. And, for XML, then Webkit,
Firefox and Opera sees it as a non-fatal error (but Opera just treats
all illegal names that way). WHile IE9 seems to se it as fatal. File an
HTML5 bug:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142
Seemingly, this has not affected Firefox users too much. Which must
EITHER mean that many of these pages *are* UTF-16 encoded OR that their
content is predominantly US-ASCII and thus the artefacts of parsing
UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't
"UTF-16") as WINDOWS-1252, do not affect users too much.
I mention it here for 3 reasons:
(1) charset=Unicode inside <meta> is caused by MSHTML, including Word.
And Boris mentioned Word's behaviour as a reason for keeping the legacy
defaulting. However, when MSHTML saves with charset=UNICODE, then for
browsers to legacy default is not the correct behaviour. (I don't know
exactly when MSHTML spits out charset=UNICODE, though - or whether it
is locale affected whether MSHTML spits out charset=UNICODE - or what.)
(2) for the user tests you suggested in Mozilla bug 708995 (above),
the presence of <meta charset=UNICODE> would trigger a need for Firefox
users to select UTF-8 - unless the locale already defaults to UTF-8;
(3) That HTML5 bug 15142 (see above) has been unknown (?) till now,
despite that it affects Firefox and Opera, hints that, for the
"WINDOWS-1252 languages", when they are served as UTF-8 but parsed as
WINDOWS-1252 (by Firefox and Opera), then users survive. (Of course,
some of these pages will be "picked up" by an Apache Content-Type:
header declaring the encoding or perhaps be chardet?
>> And are there user complaints?
>
> Not that I know of, but I'm not part of a feedback loop if there even
> is a feedback loop here.
>
>> The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
>> Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
>> repository correctly.
>
> OS-dependent differences are *very* suspicious. :-(
Mmm, yes.
>>> I think that defaulting to UTF-8 is always a bug, because at the time
>>> these localizations were launched, there should have been no unlabeled
>>> UTF-8 legacy, because up until these locales were launched, no
>>> browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
>>> UTF-8 is harmful, because it makes it possible for locale-siloed
>>> unlabeled UTF-8 content come to existence
>>
>> The current legacy encodings nevertheless creates siloed pages already.
>> I'm also not sure that it would be a problem with such a UTF-8 silo:
>> UTF-8 is possible to detect, for browsers - Chrome seems to perform
>> more such detection than other browsers.
>
> While UTF-8 is possible to detect, I really don't want to take Firefox
> down the road where users who currently don't have to suffer page load
> restarts from heuristic detection have to start suffering them. (I
> think making incremental rendering any less incremental for locales
> that currently don't use a detector is not an acceptable solution for
> avoiding restarts. With English-language pages, the UTF-8ness might
> not be apparent from the first 1024 bytes.)
FIRSTLY, HTML5:
]] 8.2.2.4 Changing the encoding while parsing
[...] This might happen if the encoding sniffing algorithm described
above failed to find an encoding, or if it found an encoding that was
not the actual encoding of the file. [[
Thus, trying to detect UTF-8 is second last step of the sniffing
algorithm. If it, correctly, detects UTF-8, then, while the detection
probably affects performance, detecting UTF-8 should not lead to a need
for re-parsing the page?
SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8
pages, then it is the browsers *outside* that silo which eventually
suffers (browser that do default to UTF-8 do not need to perform UTF-8
detect, I suppose - or what?). So then it is partly a matter of how
large the silo is.
Regardless, we must consider: The alternative to undeclared UTF-8 pages
would be to be undeclared legacy encoding pages, roughly speaking.
Which the browsers outside the silo then would have to detect. And
which would be more *demand* to detect than simply detecting UTF-8.
However, what you had in min was the change of the default encoding for
a particular silo from legacy encoding to UTF-8. This, I agree, would
lead to some pages being treated as UTF-8 - to begin with. But when the
browser detects that this is wrong, it would have to switch to -
probably - the "old" default - the legacy encoding.
However, why would it switch *from* UTF-8 if UTF-8 is the default? We
must keep the problem in mind: For the siloed browser, UTF-8 will be
its fall-back encoding.
>> In another message you suggested I 'lobby' against authoring tools. OK.
>> But the browser is also an authoring tool.
>
> In what sense?
The problem with defaults is when they take effect without one's
knowledge. Or one may think everything is OK, until one sees that it
isn't.
The Respec.js script works in your browser, and saving the output, is
one of the problems it has:
http://dev.w3.org/2009/dap/ReSpec.js/documentation.html#saving-the-generated-specification
Quote: "And sadly enough browsers aren't very good at saving HTML
that's been modified by script."
The docs does not discuss the encoding problem. But I believe that is
exactly one of the problems it has.
* A browser will save the page using the page's encoding.
* A browser will not add a META element if the page doesn't have
one. Thus, if it is HTTP which specifies the encoding, then
saving it on the computer, will mean that the next time it opens
- from the hard disk, the page will default to the locale
default, meaing that one must select UTF-8 to make the page
readable. (MSHTML - aka IE - will add the encoding - such as
charset=UNICODE ... - if you switch the encoding during saving
- I'm not exactly sure about the requirements.)
This probably needs more thought and more ideas, but what can be done
to make this better? One reason for the browser to not add <meta
charset="something" /> if the page doesn't have it already is, perhaps,
that it could be incorrect - may be because the user changed the
encoding manually. But if we consider how text editors - e.g. on the
Mac - have been working for a while now, then you have to take steps if
you *don't* want to save the page as UTF-8. So perhaps browsers could
start to behave the same way? That is: Regardless of original encoding,
save it as UTF-8, unless the user overrides it?
* Another idea: Perform heuristics more extensively when the file is
on the hard disk than when it is online? No, this could lead users to
think it work because it works offline?
>> So how can we have authors
>> output UTF-8, by default, without changing the parsing default?
>
> Changing the default is an XML-like solution: creating breakage for
> users (who view legacy pages) in order to change author behavior.
That reasoning doesn't consider that everyone that saves an HTML page
from the Web to one's hard disk, is an author. One is avoiding to make
the roundtrip behaviour more reliable because there exists an ever
diminishing amount of legacy encoded pages.
> To the extent a browser is a tool Web authors use to test stuff, it's
> possible to add various whining to console without breaking legacy
> sites for users. See
> https://bugzilla.mozilla.org/show_bug.cgi?id=672453
> https://bugzilla.mozilla.org/show_bug.cgi?id=708620
Good stuff!
>> Btw: In Firefox, then in one sense, it is impossible to disable
>> "automatic" character detection: In Firefox, overriding of the encoding
>> only lasts until the next reload.
>
> A persistent setting for changing the fallback default is in the
> "Advanced" subdialog of the font prefs in the "Content" preference
> pane.
I know. I was not commenting, here, on the "global" default encoding.
But instead on a subtle difference between the effect of a manual
override in Firefox (and IE) compared to especially Safari. In Safari -
if you have e.g. an UTF-8 page that is otherwise correctly made, e.g.
with <meta charset="UTF-8"> - then a manual switch to e.g. KOI8-R will
have lasting effect, in the current tab: You can reload the page as
many times you wish: Each time it will be treated as KOI8-R. While in
Firefox and IE, the manual switch to KOI8-R only lasts for one reload.
Next time you reload, the browser will listen to the encoding signals
from the page and from the server again.
Opera, instead, remembers your manual switch of the encoding even if
you try to open the page in a new tab or window and even after a
browser restart - Opera is alone in doing this, which I think is agains
HTML5: HTML5 only allows the browser to override what the page says
*provided* that the page doesn't say anything ... (As such, even the
Safari behaviour is dubious, I'd say. FWIT, iCab allows you to tell it
to "please start listen to the signals from the page and server,
again".)
SO: What I meant by "impossible to disable", thus, was that Firefox and
IE, from the user's perspective, behaves "automatic" even if the
auto-detect is disabled: They listen to the signals from the page and
server rather than, like Safari and Opera, listen to the "last signal
from the user".
> It's rather counterintuitive that the persistent autodetection
> setting is in the same menu as the one-off override.
You talk about View->Character_Encoding->Auto-Detect->Off ? Anyway: I
agree that the encoding menus could be simpler/clearer.
I think the most counter-intuitive thing is to use the word
"auto-detect" about the heuristic detection - see what I said above
about "behaves automatic even when auto-detect is disabled". Opera's
default setting is called "Automatic selection". So it is "all
automatic" ...
> As for heuristic detection based on the bytes of the page, the only
> heuristic that can't be disabled is the heuristic for detecting
> BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was
> believed to have been giving that sort of files to their customers and
> it "worked" in pre-HTML5 browsers that silently discarded all zero
> bytes prior to tokenization.) The Cyrillic and CJK detection
> heuristics can be turned on and off by the user.
I always wondered what the "Universal" detection meant. Is that simply
the UTF-8 detection? Or does it also detect other encodings? Unicode is
sometimes referred to as the "Universal" encoding/character set. If
that is what it means, then "Unicode" would have been cleared than
"Universal".
Hm, it seems like it is meant Universal and not Unicode, in which case
"All" or similar would have been better ...
So it seems to me that it is not possible to *enable* only *UTF-8
detection*: The only option for getting UTF-8 detection is to use the
Universal detection - which enables everything.
It seems to me that if you offered *only* UTF-8 detection, then you
would have something useful up in your sleeves if you want to tempt the
localizers *away* from UTF-8. Because as I said above: If the browser
*defaults* to UTF-8, then UTF-8 detection isn't so useful (it would
then only be useful for detecting that it is *not* Unicode).
So let's say that you tell your Welsh localizer that: "Please switch to
WINDOWS-1252 as the default, and then instead I'll allow you to enable
this brand new UTF-8 detection." Would that make sense?
> Within an origin, Firefox considers the parent frame and the previous
> document in the navigation history as sources of encoding guesses.
> That behavior is not user-configurable to my knowledge.
W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is
declared ISO-8859-1 encoded and it includes a least one ads-iframe that
is undeclared ISO-8859-1 encoded.
* If I change the default encoding of Firefox to UTF-8, then the main
page works but that ad fails, encoding wise.
* But if I enable the Universal encoding detector, the ad does not fail.
* Let's say that I *kept* ISO-8859-1 as default encoding, but instead
enabled the Universal detector. The frame then works.
* But if I make the frame page very short, 10 * the letter "ø" as
content, then the Universal detector fails - on a test on my own
computer, it guess the page to be Cyrillic rather than Norwegian.
* What's the problem? The Universal detector is too greedy - it tries
to fix more problems than I have. I only want it to guess on "UTF-8".
And if it doesn't detect UTF-8, then it should fall back to the locale
default (including fall back to the encoding of the parent frame).
Wouldn't that be an idea?
> Firefox also remembers the encoding from previous visits as long as
> Firefox otherwise has the page in cache. So for testing, it's
> necessary to make Firefox forget about previous visits to the test
> case.
--
Leif H Silli
More information about the whatwg
mailing list