[whatwg] Encoding Sniffing

Mon Apr 23 01:19:38 PDT 2012

On Sat, Apr 21, 2012 at 1:21 PM, Anne van Kesteren <annevk at opera.com> wrote:
> This morning I looked into what it would take to define Encoding Sniffing.
> http://wiki.whatwg.org/wiki/Encoding#Sniffing has links as to what I looked
> at (minus Opera internal). As far as I can tell Gecko has the most
> comprehensive approach and should not be too hard to define (though writing
> it all out correctly and clear will be some work).

The Gecko notes aren't quite right:
 * The detector chosen from the UI is used for HTML and plain text
when loading those in a browsing context from HTTP GET or from a
non-http URL. (Not used for POST responses. Not used for XHR.)
 * The default for the UI setting depends on the locale. Most locales
default to know detector at all. Only zh-TW defaults to the Universal
detector. (I'm not sure why, but I think this is a bug of *some* kind.
Perhaps the localizer wanted to detect both Traditional and Simplified
Chinese encodings and we don't have a detector configuration for
Traditional&Simplified.) Other locales that default to having a
detector enabled default to a locale-specific detector (e.g. Japanese
or Ukranian).
 * The Universal detector is used regardless of UI setting or locale
when using the FileReader to read a local file as text. (I'm
personally very unhappy about this sort of use of heuristics in a new
feature.)
 * The Universal detector isn't really universal. In particular, it
misdetects Central European encodings like ISO-8859-2. (I'm personally
unhappy that we expose the Universal detector in the UI and thereby
bait people to enable it.)
 * Regardless of detector setting, when loading HTML or plain text in
a browsing context, Basic Latin encoded as UTF-16BE or UTF-16LE is
detected. This detection is not performed by FileReader.

> I have some questions though:
>
> 1) Is this something we want to define and eventually implement the same
> way?

I think yes in principle. In practice, it might be hard to get this
done. E.g. in the case of Gecko, we'd need someone who has no higher
priority work than rewriting chardet in compliance with the
hypothetical spec.

I don't want to enable heuristic detection for all HTML page loads.
Yet, it seems that we can't get rid of it for e.g. the Japanese
context. (It's so sad that the situation is the worst in places that
have multiple encodings and, therefore, logically should be more aware
of the need to declare which one is in use. Sigh.) I think it is bad
that the Web-exposed behavior of the browser depends on the UI locale
of the browser. I think it would be worthwhile research project to
find out if that were feasible to trigger language-specific heuristic
detection on a per TLD basis instead on a per UI locale basis (e.g.
enabling the Japanese detector for all pages loaded from .jp and the
Russian detector for all pages loaded from .ru regardless of UI locale
and requiring .com Japanese or Russian sites to get their charset act
together or maybe having a short list of popular special cases that
don't use a country TLD but don't declare the encoding, either).

> 2) Does this need to apply outside HTML? For JavaScript it forbidden per the
> HTML standard at the moment. CSS and XML do not allow it either. Is it used
> for decoding text/plain at the moment?

Detection is used for text/plain in Gecko when it would be used for text/html.

I think detection shouldn't be used for anything except plain text and
HTML being loaded into browsing context considering that we've managed
this far without it (well, except for FileReader).  (Note that when
not declaring the encoding on their own JavaScript and CSS inherit the
encoding of the HTML document that references them.)

> 3) Is there a limit to how many bytes we should look at?

In Gecko, the Basic Latin encoded as UTF-16BE or UTF-16LE check is run
on the first 1024 bytes.  For the other heuristic detections, there is
no limit and changing the encoding potentially causes renavigation to
the page.  During the Firefox for development cycle, there was a limit
of 1024 bytes (no renavigation!), but it was removed in order to
support the Japanese Planet Debian (site fixed since then) and other
unspecified but rumored Japanese sites.

On Sun, Apr 22, 2012 at 2:11 AM, Silvia Pfeiffer
<silviapfeiffer1 at gmail.com> wrote:
> We've had some discussion on the usefulness of this in WebVTT - mostly
> just in relation with HTML, though I am sure that stand-alone video
> players that decode WebVTT would find it useful, too.

WebVTT is a new format with no legacy. Instead of letting it become
infected with heuristic detection, we should go the other direction
and hardwire it as UTF-8 like we did with app cache manifests and
JSON-in-XHR.  No one should be creating new content in encodings other
than UTF-8. Those who can't be bothered to use The Encoding deserve
REPLACEMENT CHARACTERs. Heuristic detection is for unlabeled legacy
content.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/