[whatwg] Default encoding to UTF-8?

Thu Dec 22 02:36:16 PST 2011

Henri Sivonen hsivonen Mon Dec 19 07:17:43 PST 2011
> On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli wrote:

Sorry for my slow reply.

> It surprises me greatly that Gecko doesn't treat "unicode" as an alias
> for "utf-16".
> 
>> Which must
>> EITHER mean that many of these pages *are* UTF-16 encoded OR that their
>> content is predominantly  US-ASCII and thus the artefacts of parsing
>> UTF-8 pages ("UTF-16" should be treated as "UTF-8 when it isn't
>> "UTF-16") as WINDOWS-1252, do not affect users too much.
> 
> It's unclear to me if you are talking about HTTP-level charset=UNICODE
> or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
> BOMless?

Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.) 
seems to usually be "BOM-full". But there are still enough occurrences 
of pages without BOM. I have found UTF-8 pages with the charset=unicode 
label in meta. But the few page I found contained either BOM or 
HTTP-level charset=utf-8. I have to little "research material" when it 
comes to UTF-8 pages with charset=unicode inside.

>>  (2) for the user tests you suggested in Mozilla bug 708995 (above),
>> the presence of <meta charset=UNICODE> would trigger a need for Firefox
>> users to select UTF-8 - unless the locale already defaults to UTF-8;
> 
> Hmm. The HTML spec isn't too clear about when alias resolution
> happens, to I (incorrectly, I now think) mapped only "UTF-16",
> "UTF-16BE" and "UTF-16LE" (ASCII-case-insensitive) to UTF-8 in meta
> without considering aliases at that point. Hixie, was alias resolution
> supposed to happen first? In Firefox, alias resolution happen after,
> so <meta charset=iso-10646-ucs-2> is ignored per the non-ASCII
> superset rule.

Waiting to hear what Hixie says ...

>>> While UTF-8 is possible to detect, I really don't want to take Firefox
>>> down the road where users who currently don't have to suffer page load
>>> restarts from heuristic detection have to start suffering them. (I
>>> think making incremental rendering any less incremental for locales
>>> that currently don't use a detector is not an acceptable solution for
>>> avoiding restarts. With English-language pages, the UTF-8ness might
>>> not be apparent from the first 1024 bytes.)
>>
>> FIRSTLY, HTML5:
>>
>> ]] 8.2.2.4 Changing the encoding while parsing
>> [...] This might happen if the encoding sniffing algorithm described
>> above failed to find an encoding, or if it found an encoding that was
>> not the actual encoding of the file. [[
>>
>> Thus, trying to detect UTF-8 is second last step of the sniffing
>> algorithm. If it, correctly, detects UTF-8, then, while the detection
>> probably affects performance, detecting UTF-8 should not lead to a need
>> for re-parsing the page?
> 
> Let's consider, for simplicity, the locales for Western Europe and the
> Americas that default to Windows-1252 today. If browser in these
> locales started doing UTF-8-only detection, they could either:
>  1) Start the parse assuming Windows-1252 and reload if the detector 
> says UTF-8.

When the detector says UTF-8 - that is step 7 of the sniffing algorith, 
no?
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

>  2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
> detector says non-UTF-8.
> 
> (Buffering the whole page is not an option, since it would break
> incremental loading.)
> 
> Option #1 would be bad, because we'd see more and more reloading over
> time assuming that authors start using more and more UTF-8-enabled
> tools over time but don't go through the trouble of declaring UTF-8,
> since the pages already seem to "work" without declarations.

So the so called badness is only a theory about what will happen - how 
the web will develop. As is, there is nothing particular bad about 
starting out with UTF-8 as the assumption.

I think you are mistaken there: If parsers perform UTF-8 detection, 
then unlabelled pages will be detected, and no reparsing will happen. 
Not even increase. You at least need to explain this negative spiral 
theory better before I buy it ... Step 7 will *not* lead to reparsing 
unless the default encoding is WINDOWS-1252. If the default encoding is 
UTF-8, then step 7, when it detects UTF-8, then it means that parsing 
can continue uninterrupted.

What we will instead see is that those using legacy encodings must be 
more clever in labelling their pages, or else they won't be detected. 

I am a bitt baffled here: It sounds like you say that there will be bad 
consequences if browsers becomes more reliable ...

> Option #2 would be bad, because pages that didn't reload before would
> start reloading and possibly executing JS side effects twice.

#1 sounds least bad, since the only badness you describe is a theory 
about what this behaviour would lead to, w.r.t authors. 

>> SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8
>> pages, then it is the browsers *outside* that silo which eventually
>> suffers (browser that do default to UTF-8 do not need to perform UTF-8
>> detect, I suppose - or what?). So then it is partly a matter of how
>> large the silo is.
>>
>> Regardless, we must consider: The alternative to undeclared UTF-8 pages
>> would be to be undeclared legacy encoding pages, roughly speaking.
>> Which the browsers outside the silo then would have to detect. And
>> which would be more *demand* to detect than simply detecting UTF-8.
> 
> Well, so far (except for sv-SE (but no longer) and zh-TW), Firefox has
> not *by default* done cross-silo detection and has managed to get
> non-trivial market share, so it's not a given that browsers from
> outside a legacy silo *have to* detect.

Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding 
detection. So it might still be an competitive advantage. 

>> However, what you had in min was the change of the default encoding for
>> a particular silo from legacy encoding to UTF-8. This, I agree, would
>> lead to some pages being treated as UTF-8 - to begin with. But when the
>> browser detects that this is wrong, it would have to switch to -
>> probably - the "old" default - the legacy encoding.
>>
>> However, why would it switch *from* UTF-8 if UTF-8 is the default? We
>> must keep the problem in mind: For the siloed browser, UTF-8 will be
>> its fall-back encoding.
> 
> Doesn't the first of these two paragraphs answer the question posed in
> the second one?

Depends. HTML5 says about UTF-8 detection: "Documents that contain 
bytes with values greater than 0x7F which match the UTF-8 pattern are 
very likely to be UTF-8". 

In other words: HTML5's - nevertheless - legacy encoding biased 
approach, takes as assumption that without non-ASCII bytes, the page 
can safely default to Windows-1252. In this scheme, when the parsing 
reaches step 7 of the encoding sniffing, the browser will just stay in 
its default/legacy encoding modus, unless the author was smart enough 
to add a snowman symbol to the otherwise, for the time being, US-ASCII 
web page.

The approach I suggest, however, would assume UTF-8, unless those 
non-ASCII bytes are incompatible with UTF-8. So, if the page is 
US-ASCII-compatible, my approach would default the page to UTF-8. 
(Whereas HTML5 specifies - or at least describes - that such pages 
should default to the locale.)  This is a quite relevant thing to do 
when we consider that there are a few authoring tools/editors that spit 
out US-ASCII and simply convert non-US-ASCII to character references: 
Many US-ASCII labelled pages contains characters from "the whole 
unicode".  

 [ Shaking my head towards Anne: A simple default to WINDOWS-1252 in 
face of the US-ASCII label, therefore seems like a bad idea. ]

One can also simply observe that Firefox performs less detection than 
Opera and Chrome, for instance. (This is my 'observatory': 
http://www.malform.no/testing/utf/). Further more: There is more 
detection going on in XML it seems ... There is at least no general 
move to less and less detection ...

I cannot see that the approach I suggest above will lead to more 
reloading. If the trend towards UTF-8 continues, then we will, no 
doubt, see more unlabelled UTF-8. But this can only lead to reloading 
if the default it *not* UTF-8.

>>> It's rather counterintuitive that the persistent autodetection
>>> setting is in the same menu as the one-off override.
>>
>> You talk about View->Character_Encoding->Auto-Detect->Off ? Anyway: I
>> agree that the encoding menus could be simpler/clearer.
>>
>> I think the most counter-intuitive thing is to use the word
>> "auto-detect" about the heuristic detection - see what I said above
>> about "behaves automatic even when auto-detect is disabled". Opera's
>> default setting is called "Automatic selection". So it is "all
>> automatic" ...
> 
> Yeah, "automatic" means different things in different browsers.

One of my hobbyhorses is automatic and default often are synonyms ...

>>> As for heuristic detection based on the bytes of the page, the only
>>> heuristic that can't be disabled is the heuristic for detecting
>>> BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was

Just wanted to say that it seems like Webkit has no customers in that 
bank ... See what I said in another mail about Webkit's bad UTF-16 
detection.

>>> believed to have been giving that sort of files to their customers and
>>> it "worked" in pre-HTML5 browsers that silently discarded all zero
>>> bytes prior to tokenization.) The Cyrillic and CJK detection
>>> heuristics can be turned on and off by the user.
>>
>> I always wondered what the "Universal" detection meant. Is that simply
>> the UTF-8 detection?
> 
> Universal means that it runs all the detectors that Firefox supports
> in parallel, so possible guessing space isn't constrained by locale.
> The other modes constrain the guessing space to a locale. For example,
> the Japanese detector won't give a Chinese or Cyrillic encoding as its
> guess.

OK. I guess Chrome runs with a universal detector all the time ... The 
Japanese detection is, I think, pretty "heavy": There are several 
Japanese encodings, in addition to UTF-16 and UTF-8. And yet, you have 
said that for the Asian locales, detection is often auto-enabled. Not 
to speak about the Univeral detection (which I think must be very 
little used).

And then *I* speak about a rather light detection, which only checks 
for UTF-8. And this you say is too much. I don't get it. Or rather: 
Don't buy it. :-)

>> So let's say that you tell your Welsh localizer that: "Please switch to
>> WINDOWS-1252 as the default, and then instead I'll allow you to enable
>> this brand new UTF-8 detection." Would that make sense?
> 
> Not really. I think we shouldn't spread heuristic detection to any
> locale that doesn't already have it.

"spread to locale": For me, Chrome will detect this page as Arabic: 
<http://www.malform.no/testing/utf/html/koi8/1>. *That* is too much 
'spread of heuristic detection', to my taste. But do you consider UTF-8 
detection as "heuristic detection" too, then? If so, then we clearly 
disagree. 

>>> Within an origin, Firefox considers the parent frame and the previous
>>> document in the navigation history as sources of encoding guesses.
>>> That behavior is not user-configurable to my knowledge.
>>
>> W.r.t. iframe, then the "big in Norway" newspaper Dagbladet.no is
>> declared ISO-8859-1 encoded and it includes a least one ads-iframe that
>> is undeclared ISO-8859-1 encoded.
>>
>> * If I change the default encoding of Firefox to UTF-8, then the main
>> page works but that ad fails, encoding wise.
> 
> Yes, because the ad is different-origin, so it doesn't inherit the
> encoding from the parent page.
> 
>> * But if I enable the Universal encoding detector, the ad does not fail.
>>
>> * Let's say that I *kept* ISO-8859-1 as default encoding, but instead
>> enabled the Universal detector. The frame then works.
>> * But if I make the frame page very short, 10 * the letter "ø" as
>> content, then the Universal detector fails - on a test on my own
>> computer, it guess the page to be Cyrillic rather than Norwegian.
>> * What's the problem? The Universal detector is too greedy - it tries
>> to fix more problems than I have. I only want it to guess on "UTF-8".
>> And if it doesn't detect UTF-8, then it should fall back to the locale
>> default (including fall back to the encoding of the parent frame).
>>
>> Wouldn't that be an idea?
> 
> No. The current configuration works for Norwegian users already. For
> users from different silos, the ad might break, but ad breakage is
> less bad than spreading heuristic detection to more locales.

Here I must disagree: Less bad for whom? Quite bad for the newspaper. 
And bad for anyone with a web site that is depending on iframe content 
from another site. The current behaviour means that it becomes more 
complicated to move to UTF-8.  With a shift in mentality, we can move 
more firmly to UTF-8.
-- 
Leif H Silli