[whatwg] Encoding sniffing algorithm
Leif Halvard Silli
xn--mlform-iua at xn--mlform-iua.no
Sun Sep 9 22:31:25 PDT 2012
Ian Hickson ian at hixie.ch on Thu Sep 6 12:55:03 PDT 2012:
> On Fri, 27 Jul 2012, Leif Halvard Silli wrote:
>> Revised encoding sniffing algorithm proposal:
>> NEW! 0. document is XML format - opt out of the algorithm.
>> [This step is already implicit in the spec, but it would
>> make sense to explicitly include it to make sure that
>> one could e.g. write test cases to see that it is step
>> is implemented. Currently Safari, Chrome and Opera do
>> not 100% implement this step.]
> I don't understand the relevance of the algorithm to XML. Why would anyone
> even look at this algorithm if they were parsing XML?
In principle it should not be needed. Agree.
But many of those who are parsing XML are also parsing HTML - for that
reason it should be natural for them to compare specs and requirements.
Currently, in particular Webkit and Chromium seem to be colored by
their HTML parsing when they parse XML. (See the table in my blog
post.) Also, the spec do a few time includes phrases similar to "if it
is XML, then abort these steps" (for example in '3.4.1 Opening the
input stream'),[*] so there is some precedence, I think.
>> NEW! #. Alternative: The BOM signature could go here instead of
>> in step 5. There is a bug to move the BOM hereto and make
>> it override anything else. What speaks against this are:
>> a) that Firefox, IE10 and Opera do not currently have
>> this behavior.
>> b) this revision of the sniffing algorithm, especially
>> the revision in step 6 (required UTF-8 detection),
>> might make the BOM-trumps-everything-else override
>> less necessary
>> What speaks for this override:
>> a) Safari, Chrome and legacy IE implement it.
>> b) some legacy content may depend on it
> Not sure what this means.
You will be dealing with it when you take care of Anne's bug: "Bug
15359 Make BOM trump HTTP". [*] Thus, you can just ignore it.
>> 1. user override.
>> (PS: The spec should clarify whether user override is
> This seems to be entirely a user interface issue.
But then, why do you go on to describe it in the new note? (See below.)
>> NEW! 2. iframe inherits user override from parent browsing context
>> [Currently not mentioned in the spec, despite that "all"
>> UAs do have this step for HTML docs.]
> That's a UI issue much like whether it's remembered or not. But I've added
> a non-normative note.
Your new note:
"""1. Typically, user agents remember such user requests
across sessions, and in some cases apply them to
documents in iframes as well."""
1: How does that differ from the "info on the likely encoding" step?
2: Could you define 'sessions' somewhere? It sounds to me that the
'sessions' behavior that you describe resembles the Opera behavior.
Which is bad when the Opera behavior is the least typical one. (And
most annoying from a page developer's point of view.) The typical thing
- which Opera breaks! - is to, in some way or another, limit the
encoding override to the current *tab* only. Thus, if you insist on
describing what UAs "typically" do, then you should instead of
describing the exception (Opera), say that browsers *differ*, but that
the typical thing is to limit the encoding override, some way or
another, to the current tab.
3: Browses differ enough for you to evaluate how they behave and
pick the best behavior. However, I'd say Firefox is best as it offers a
compromise between IE and Webkit. (See belows.)
Comments in more details:
FIRSTLY: Regarding "across sessions". then my assumption would be
that a "single session" is equal to the lifespan of a single tab (or a
single window, if there is no Tab in the window). If so, then that is
how Safari/Chrome behave: Override lasts as long as one stays in the
SECONDLY: Does 'sessions' relate to a particular document - as in
"document during several sessions"? Or to a particular tab/window - as
in "session = tab"?
* Under FIRSTLY, I described how Safari/Chrome behave: They do not
give heed to the document. They *only* give heed to the current
tab/window: If you override a document to use the KOI8-R encoding then
the next document you load in the same tab will use the KOI8-R encoding
* Internet Explorer (version 8, at least) will, by contrast, give
heed to that particular document, it seems. Thus, it seems to not reuse
the overridden encoding in case it meets a new document, in the same
tab, whose encoding is not declared. *However*, just as Safari/Chrome,
once you open the same document (whose encoding was overridden) in a
new tab, then it doesn't remember the encoding override anymore. So the
encoding override is tied to document as long as it is loaded in the
* Firefox behaves as Safari/Chrome but with one very important
difference: Let's say you first override the encoding. Now, if you load
a new document in the same tab, and whose encoding is correctly
declared, then the declared encoding will be used. (So to Safari/Chrome
it applies to all docs in same tab - whereas to Firefox it only applies
to docs with undeclared encoding)
* Opera is the aggressive one - it remembers the encoding even if I
load the page in another tab.
>> NEW! 6. UTF-8 detection.
>> I think we should separate UTF-8 detection from other
>> detection in order to make this step obligatory.
>> The newness here is only the limitation to UTF-8
>> detection plus that it should be obligatory.
>> (Thus: If it is not detected as UTF-8, then
>> the parser proceeds to next step in the algorithm.)
>> This step would make browsers lean more strongly
>> towards UTF-8.
> Without a specific algorithm to detect UTF-8, this is meaningless.
Right ... How about the UTF-8 detector in Mozilla's chardet - I read
that it detects sequences that are unique to UTF-8. Just that class
that detects UTF-8. (I tried to find that class, but I am was not sure
how to locate it ... can't read C code ...
>> NEW! 7. parent browsing context default.
>> The current spec does not mention this step at all,
>> despite that both Opera, IE, Safari, Chrome, Firefox
>> do implement it.
> Added. (Some comprehensive testing of this would be good, e.g. comparing
> it to each of the earlier and later steps, considering it with different
> ways of giving the encoding, differnet locales, etc.)
Indeed. Different domains is a very relevant point: Shortly after my
publication in July, I was made aware of the fact that I had not taken
account of that in the description. Namely: The parent browsing context
and the iframe document have got to be from the same domain. If they
are not from the same domain, then the iframe does not inherit the
encoding from the parent browsing context. I could not find a single
[current] browser that let the parent browsing context win if the two
contexts were from different domain.
I have a test file here which test some aspects this, including (as of
today!) the different domain thing:
Otherwise, I have tried to test the other things to you describe - the
earlier and later steps, different ways of giving the encoding etc.
However, I will probably take a look at it again to see if I find that
I have overlooked something etc - and may be make some more tests. (It
is remarkable how fast one's mind is blurred on these things - one
thinks one remember but ...)
>> Regarding 6. and 7., then the order is important. Chrome
>> does for instance perform UTF-8 detection, but it does it
>> only /after/ the parent browsing context. Whereas everyone
>> else (Opera 12 by default, Firefox for some locales - don't
>> know if there are others) let it happen before the 'parent
>> browsing context default'.
> Can you elaborate on this?
This is a tricky topic.
First: All browsers that perform locale encoding sniffing, also
performs UTF-8 sniffing. (Exception: Firefox Cyrillic sniffer)
Second: For browsers which do *NOT* perform UTF-8 sniffing, then
the parent browsing context encoding is typically the
SECOND LAST step, before the browser/locale default.
Third: So when will encoding inheritance from the parent browsing
context take place in browsers that DO perform sniffing?
* Will it happen after sniffing - and thus overrule the sniffing?
* Or will the sniffing overrule the parent browsing context?
In order to find the answers to the the third point, we must
investigate the resulting encoding of an, in principle sniffable, page
that gets served as the iframe of another page whose encoding is
In order to check browser behavior, please take a look at the 4th -
FOURTH - frame of this page:
WHAT TO LOOK FOR: The parent page is KOI8-R encoded. Whereas the fourth
frame is UTF-8 encoded, but without encoding declaration in any form.
Now, if the browser's resulting encoding for the fourth iframe is
KOI8-R, then we can be close to 100% certain that parent the encoding
default overrules the sniffing.
Chrome: parent browsing context WINS over sniffing.
Opera: sniffing wins over parent browsing context
Firefox: sniffing wins over parent browsing context
For Opera and Firefox, I tested both by using localized browsers as
well as by simply selecting the "automatic choice" option in the
respective browsers. (The point, right now, is to document what
function that wins *if* the sniffing is enabled - and not to document
*when* sniffing is enabled.)
By the way, here is an overview over which browsers and localizations
that, as much I have been able to find out, comes with sniffing enabled
(2) UTF-8 sniffing + sniffing of varying locale encodings:
* All locales of Chrome
* Russian/Ukrainian/Byelorussian locale of Opera
* Japanese/Chinese/Korean locale of Opera and Firefox
Firefox Russian (and Belorussian/Ukrainian ??) locale did offer
sniffing by default it before. But I imagine that it was disabled
because Firefox's Cyrillic encoding sniffing does not offer UTF-8
sniffing. And a sniffer without UTF-8 sniffing is probably almost
useless as it will "sniff" UTF-8 to be Cyrillic. (At least, that is how
the Cyrillic sniffer currently behaves, when enabled.)
>> NEW! 8. info on “the likely encoding”
>> The main newness is that this step is placed _after_
>> the (revised) UTF-8 detection and after the (new) parent
>> browsing context default.
>> The name 'the likely encoding' is from the current spec
>> text. I am a bit uncertain about what it means in the
>> current spec, though. So I move here what I think make
>> sense. The steps under this point should perhaps be
>> a. detection of other charsets than UTF-8
>> (e.g the optional Cyrillic detection in
>> Firefox or legacy Asian encoding detection.
>> The actual detection might happen in step 6,
>> but it should only be made to count here.)
> I don't understand your reasoning on the desired ordering here.
It is related to the previous step 7. (But I got to think more about
>> b. markup label of the sister language
>> <?xml version="1.0" encoding="UTF-8"?>
>> (Opera/Webkit/Chrome currently have this directly
>> after the native encoding label step - step 5.
> No idea what this means.
Or you disagree? That's fair enough. Perhaps you should that the XML
encoding is forbidden to *use*? Some browsers do use it. Are they, by
this, violating the "likely encoding step"? To me, if a browser sees
the above XML encoding declariation, and it has nothing more to go on,
then it could very well guess that the page is likely to be UTF-8, not?
>> c. Other things? What does "likely encoding" current
>> refer to, exactly?
> The spec gives an example.
I don't understand how the exxample. It sounds as if what you describe
here fits with Safari/CHrome's manual encoding override behavior (which
I described above and which takes place earlier in the algorithm, for
leif halvard silli
More information about the whatwg