[whatwg] Default encoding to UTF-8?

Henri Sivonen hsivonen at iki.fi
Tue Dec 6 23:58:26 PST 2011

On Mon, Dec 5, 2011 at 8:55 PM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
> When you say 'requires': Of course, HTML5 recommends that you declare
> the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via
> <meta charset=UTF-8>). I just now also discovered that Validator.nu
> issues an error message if it does not find any of of those *and* the
> document contains non-ASCII. (I don't know, however, whether this error
> message is just something Henri added at his own discretion - it would
> be nice to have it literally in the spec too.)

I believe I was implementing exactly what the spec said at the time I
implemented that behavior of Validator.nu. I'm particularly convinced
that I was following the spec, because I think it's not the optimal
behavior. I think pages that don't declare their encoding should
always be non-conforming even if they only contain ASCII bytes,
because that way templates created by English-oriented (or lorem ipsum
-oriented) authors would be caught as non-conforming before non-ASCII
text gets filled into them later. Hixie disagreed.

> HTML5 says that validators *may* issue a warning if UTF-8 is *not* the
> encoding. But so far, validator.nu has not picked that up.

Maybe it should. However, non-UTF-8 pages that label their encoding,
that use one of the encodings that we won't be able to get rid of
anyway and that don't contain forms aren't actively harmful. (I'd
argue that they are *less* harmful than unlabeled UTF-8 pages.)
Non-UTF-8 is harmful in form submission. It would be more focused to
make the validator complain about labeled non-UTF-8 if the page
contains a form. Also, it could be useful to make Firefox whine to
console when a form is submitted in non-UTF-8 and when an HTML page
has no encoding label. (I'd much rather implement all these than
implement breaking changes to how Firefox processes legacy content.)

>> We should also lobby for authoring tools (as recommended by HTML5) to
>> default their output to UTF-8 and make sure the encoding is declared.
> HTML5 already says: "Authoring tools should default to using UTF-8 for
> newly-created documents. [RFC3629]"
> http://dev.w3.org/html5/spec/semantics.html#charset

I think focusing your efforts on lobbying authoring tool vendors to
withhold the ability to save pages in non-UTF-8 encodings would be a
better way to promote UTF-8 than lobbying browser vendors to change
the defaults in ways that'd break locale-siloed Existing Content.

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list