[whatwg] Internal character encoding declaration
hsivonen at iki.fi
Mon Aug 8 11:42:27 PDT 2005
Quoting from WA1 draft section 126.96.36.199. Specifying and establishing the
document's character encoding:
> The meta element may also be used, in HTML only (not in XHTML) to
> provide UAs with character encoding information for the file. To do
> this, the meta element must be the first element in the head element,
To cater for implementations that consume the byte stream only once in
all cases and do not rewind the input and restart the parser upon
discovering the meta, I think it would be beneficial to additionally
1. The meta element-based character encoding information declaration is
expected to work only if the Basic Lating range of characters maps to
the same bytes as in the US-ASCII encoding.
2. If there is no external character encoding information nor a BOM
(see below), there MUST NOT be any non-ASCII bytes in the document byte
stream before the end of the meta element that declares the character
encoding. (In practice this would ban unescaped non-ASCII class names
on the html and body elements and non-ASCII comments at the beginning
of the document.)
> it must have the http-equiv attribute set to the literal value
I think case-insensitivity should be allowed in the string
"Content-Type", because there is legacy precedent for that and HTTP
defines header names as case-insensitive.
> and must have the content attribute set to the literal value
> text/html; charset=
That string should be case-insensitive as well, because HTTP defines it
case-insensitive. Also, should zero or more white space characters be
allowed before ';' and around '=' and should the space after ';' be one
or more white space characters? HTTP-wise yes, but would it lead to
real-world incompatibilities? (I have not tested.)
> immediately followed by the character encoding, which must be a valid
> character encoding name. [IANACHARSET] When the meta element is used
> in this way, there must be no other attributes set on the element.
> Other than for giving the document's character encoding in this way,
> the http-equiv attribute must not be used.
> In XHTML, the XML declaration should be used for inline character
> encoding information.
> Authors should avoid including inline character encoding information.
> Character encoding information should instead be included at the
> transport level (e.g. using the HTTP Content-Type header).
With HTML with contemporary UAs, there is no real harm in including the
character encoding information both on the HTTP level and in the meta
as long as the information is not contradictory. On the contrary, the
author-provided internal information is actually useful when end users
save pages to disk using UAs that do not reserialize with internal
character encoding information.
With XML, there is a robust method for identifying the character
encoding internally. When the encoding is explicit, the sniffing is
also interoperably implemented. (Unfortunately, for the BOMless
implicit case, see http://bugzilla.opendarwin.org/show_bug.cgi?id=3809
. Gecko used to have the same bug.) RFC 3023's insistence on declaring
the encoding authoritatively outside the XML byte stream itself is, in
my opinion, as silly as insisting on declaring the compression method
of a zip archive authoritatively on the HTTP level instead of using the
information stored in the file.
The TAG has found "Thus there is no ambiguity when the charset is
omitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to use
the charset is misplaced for application/xml and for non-text "+xml"
> For HTML, user agents must use the following algorithm in determining
> the character encoding of a document:
> 1. If the transport layer specifies an encoding, use that.
Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only;
UTF-32 makes no practical sense for interchange on the Web.)
> 2. Otherwise, if the user agent can find a meta element that specifies
> character encoding information (as described above), then use that.
If a conformance checker has not determined the character encoding by
now, what should it do? Should it report the document as non-conforming
(my preferred choice)? Should it default to US-ASCII and report any
non-ASCII bytes as conformance errors? Should it continue to the
fuzzier steps like browsers would (hopefully not)?
> 3. Otherwise, if the user agent can autodetect the character encoding
> from applying frequency analysis or other algorithms to the data
> stream, then use that.
> 4. Otherwise, use an implementation-defined or user-specified default
> character encoding (ISO-8859-1, windows-1252, and UTF-8 are
> recommended as defaults, and can in many cases be identified by
> inspection as they have different ranges of valid bytes).
I think it does not make sense to recommend ISO-8859-1, because
windows-1252 is always a better guess in practice. In the context of
HTML, UTF-8 looks like a weird default considering years of precedent
with the de facto windows-1252 default. (Of course, if the UA is
willing to examine the entire byte stream before parsing, UTF-8 can be
detected very reliably.)
hsivonen at iki.fi
More information about the whatwg