[whatwg] Default encoding to UTF-8?

Mon Dec 19 07:17:43 PST 2011

On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
>>>> (which means
>>>> *other-language* pages when the language of the localization doesn't
>>>> have a pre-UTF-8 legacy).
>>>
>>> Do you have any concrete examples?
>>
>> The example I had in mind was Welsh.
>
> Logical candidate. WHat do you know about the Farsi and Arabic local?

Nothing basically.

> I discovered that "UNICODE" is
> used as alias for "UTF-16" in IE and Webkit.
...
> Seemingly, this has not affected Firefox users too much.

It surprises me greatly that Gecko doesn't treat "unicode" as an alias
for "utf-16".

> Which must
> EITHER mean that many of these pages *are* UTF-16 encoded OR that their
> content is predominantly  US-ASCII and thus the artefacts of parsing
> UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't
> "UTF-16") as WINDOWS-1252, do not affect users too much.

It's unclear to me if you are talking about HTTP-level charset=UNICODE
or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
BOMless?

>  (2) for the user tests you suggested in Mozilla bug 708995 (above),
> the presence of <meta charset=UNICODE> would trigger a need for Firefox
> users to select UTF-8 - unless the locale already defaults to UTF-8;

Hmm. The HTML spec isn't too clear about when alias resolution
happens, to I (incorrectly, I now think) mapped only "UTF-16",
"UTF-16BE" and "UTF-16LE" (ASCII-case-insensitive) to UTF-8 in meta
without considering aliases at that point. Hixie, was alias resolution
supposed to happen first? In Firefox, alias resolution happen after,
so <meta charset=iso-10646-ucs-2> is ignored per the non-ASCII
superset rule.

>> While UTF-8 is possible to detect, I really don't want to take Firefox
>> down the road where users who currently don't have to suffer page load
>> restarts from heuristic detection have to start suffering them. (I
>> think making incremental rendering any less incremental for locales
>> that currently don't use a detector is not an acceptable solution for
>> avoiding restarts. With English-language pages, the UTF-8ness might
>> not be apparent from the first 1024 bytes.)
>
> FIRSTLY, HTML5:
>
> ]] 8.2.2.4 Changing the encoding while parsing
> [...] This might happen if the encoding sniffing algorithm described
> above failed to find an encoding, or if it found an encoding that was
> not the actual encoding of the file. [[
>
> Thus, trying to detect UTF-8 is second last step of the sniffing
> algorithm. If it, correctly, detects UTF-8, then, while the detection
> probably affects performance, detecting UTF-8 should not lead to a need
> for re-parsing the page?

Let's consider, for simplicity, the locales for Western Europe and the
Americas that default to Windows-1252 today. If browser in these
locales started doing UTF-8-only detection, they could either:
 1) Start the parse assuming Windows-1252 and reload if the detector says UTF-8.
 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
detector says non-UTF-8.

(Buffering the whole page is not an option, since it would break
incremental loading.)

Option #1 would be bad, because we'd see more and more reloading over
time assuming that authors start using more and more UTF-8-enabled
tools over time but don't go through the trouble of declaring UTF-8,
since the pages already seem to "work" without declarations.

Option #2 would be bad, because pages that didn't reload before would
start reloading and possibly executing JS side effects twice.

> SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8
> pages, then it is the browsers *outside* that silo which eventually
> suffers (browser that do default to UTF-8 do not need to perform UTF-8
> detect, I suppose - or what?). So then it is partly a matter of how
> large the silo is.
>
> Regardless, we must consider: The alternative to undeclared UTF-8 pages
> would be to be undeclared legacy encoding pages, roughly speaking.
> Which the browsers outside the silo then would have to detect. And
> which would be more *demand* to detect than simply detecting UTF-8.

Well, so far (except for sv-SE (but no longer) and zh-TW), Firefox has
not *by default* done cross-silo detection and has managed to get
non-trivial market share, so it's not a given that browsers from
outside a legacy silo *have to* detect.

> However, what you had in min was the change of the default encoding for
> a particular silo from legacy encoding to UTF-8. This, I agree, would
> lead to some pages being treated as UTF-8 - to begin with. But when the
> browser detects that this is wrong, it would have to switch to -
> probably - the "old" default - the legacy encoding.
>
> However, why would it switch *from* UTF-8 if UTF-8 is the default? We
> must keep the problem in mind: For the siloed browser, UTF-8 will be
> its fall-back encoding.

Doesn't the first of these two paragraphs answer the question posed in
the second one?

>> It's rather counterintuitive that the persistent autodetection
>> setting is in the same menu as the one-off override.
>
> You talk about View->Character_Encoding->Auto-Detect->Off ? Anyway: I
> agree that the encoding menus could be simpler/clearer.
>
> I think the most counter-intuitive thing is to use the word
> "auto-detect" about the heuristic detection - see what I said above
> about "behaves automatic even when auto-detect is disabled". Opera's
> default setting is called "Automatic selection". So it is "all
> automatic" ...

Yeah, "automatic" means different things in different browsers.

>> As for heuristic detection based on the bytes of the page, the only
>> heuristic that can't be disabled is the heuristic for detecting
>> BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was
>> believed to have been giving that sort of files to their customers and
>> it "worked" in pre-HTML5 browsers that silently discarded all zero
>> bytes prior to tokenization.) The Cyrillic and CJK detection
>> heuristics can be turned on and off by the user.
>
> I always wondered what the "Universal" detection meant. Is that simply
> the UTF-8 detection?

Universal means that it runs all the detectors that Firefox supports
in parallel, so possible guessing space isn't constrained by locale.
The other modes constrain the guessing space to a locale. For example,
the Japanese detector won't give a Chinese or Cyrillic encoding as its
guess.

> So let's say that you tell your Welsh localizer that: "Please switch to
> WINDOWS-1252 as the default, and then instead I'll allow you to enable
> this brand new UTF-8 detection." Would that make sense?

Not really. I think we shouldn't spread heuristic detection to any
locale that doesn't already have it.

>> Within an origin, Firefox considers the parent frame and the previous
>> document in the navigation history as sources of encoding guesses.
>> That behavior is not user-configurable to my knowledge.
>
> W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is
> declared ISO-8859-1 encoded and it includes a least one ads-iframe that
> is undeclared ISO-8859-1 encoded.
>
> * If I change the default encoding of Firefox to UTF-8, then the main
> page works but that ad fails, encoding wise.

Yes, because the ad is different-origin, so it doesn't inherit the
encoding from the parent page.

> * But if I enable the Universal encoding detector, the ad does not fail.
>
> * Let's say that I *kept* ISO-8859-1 as default encoding, but instead
> enabled the Universal detector. The frame then works.
> * But if I make the frame page very short, 10 * the letter "ø" as
> content, then the Universal detector fails - on a test on my own
> computer, it guess the page to be Cyrillic rather than Norwegian.
> * What's the problem? The Universal detector is too greedy - it tries
> to fix more problems than I have. I only want it to guess on "UTF-8".
> And if it doesn't detect UTF-8, then it should fall back to the locale
> default (including fall back to the encoding of the parent frame).
>
> Wouldn't that be an idea?

No. The current configuration works for Norwegian users already. For
users from different silos, the ad might break, but ad breakage is
less bad than spreading heuristic detection to more locales.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/