[whatwg] Internal character encoding declaration
Henri Sivonen
hsivonen at iki.fi
Sat Mar 11 07:10:31 PST 2006
On Mar 10, 2006, at 22:49, Ian Hickson wrote:
> I'm actually considering just requiring that UAs support rewinding (by
> defining the exact semantics of how to parse for the <meta>
> header). Is
> this something people would object to?
I think allowing in-place decoder change (when feasible) would be
good for performance.
>> I think it would be beneficial to additionally stipulate that
>> 1. The meta element-based character encoding information
>> declaration is
>> expected to work only if the Basic Latin range of characters maps
>> to the same
>> bytes as in the US-ASCII encoding.
>
> Is this realistic? I'm not really familiar enough with character
> encodings
> to say if this is what happens in general.
I suppose it is realistic. See below.
>> 2. If there is no external character encoding information nor a
>> BOM (see
>> below), there MUST NOT be any non-ASCII bytes in the document byte
>> stream before the end of the meta element that declares the character
>> encoding. (In practice this would ban unescaped non-ASCII class
>> names on
>> the html and [head] elements and non-ASCII comments at the
>> beginning of
>> the document.)
>
> Again, can we realistically require this? I need to do some studies of
> non-latin pages, I guess.
As UA behavior, no. As a conformance requirement, maybe.
>>> Authors should avoid including inline character encoding
>>> information.
>>> Character encoding information should instead be included at the
>>> transport level (e.g. using the HTTP Content-Type header).
>>
>> I disagree.
>>
>> With HTML with contemporary UAs, there is no real harm in
>> including the
>> character encoding information both on the HTTP level and in the
>> meta as
>> long as the information is not contradictory. On the contrary, the
>> author-provided internal information is actually useful when end
>> users
>> save pages to disk using UAs that do not reserialize with internal
>> character encoding information.
>
> ...and it breaks everything when you have a transcoding proxy, or
> similar.
Well, not until you save to disk, since HTTP takes precedence.
However, authors can escape this by using UTF-8. (Assuming here that
tampering with UTF-8 would be harmful, wrong and pointless.)
Interestingly, transcoding proxies tend to be brought up by residents
of Western Europe, North America or the Commonwealth. I have never
seen a Russion person living in Russia or a Japanese person living in
Japan talk about transcoding proxies in any online or offline
discussion. That's why I doubt the importance of transcoding proxies.
FWIW, I think Opera Mini is a distributed UA--not a proxy and a UA.
> Character encoding information shouldn't be duplicated, IMHO,
> that's just
> asking for trouble.
I suggest a mismatch be considered an easy parse error and,
therefore, reportable.
>>> For HTML, user agents must use the following algorithm in
>>> determining the
>>> character encoding of a document:
>>> 1. If the transport layer specifies an encoding, use that.
>>
>> Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8
>> only; UTF-32
>> makes no practical sense for interchange on the Web.)
>
> I don't know, should there?
I believe there should.
>>> 2. Otherwise, if the user agent can find a meta element that
>>> specifies
>>> character encoding information (as described above), then use that.
>>
>> If a conformance checker has not determined the character encoding by
>> now, what should it do? Should it report the document as non-
>> conforming
>> (my preferred choice)? Should it default to US-ASCII and report any
>> non-ASCII bytes as conformance errors? Should it continue to the
>> fuzzier
>> steps like browsers would (hopefully not)?
>
> Again, I don't know.
I'll continue to treat such documents as non-conforming, then.
> Currently the behaviour is very underspecified here:
>
> http://whatwg.org/specs/web-apps/current-work/#documentEncoding
>
> I'd like to rewrite that bit. It will require a lot of research; of
> existing authoring practices, of current UAs, and of author needs. If
> anyone wants to step up and do the work, I'd be very happy to work
> with
> them and get something sorted out here.
Disclaimer: This is not based on reading the source of the Gecko or
WebKit. Instead, this is based on quick research in character
encodings and on black box testing of Firefox 1.5, Opera 9.0 preview
and Safari 2.0.3. Tests: http://hsivonen.iki.fi/test/wa10/encoding-
detection/ (c- means that I think it should be a conforming case and
nc- means that I think it should be a non-conforming case.)
It turns out that most character encodings have the property that in
the initial state of the decoder the bytes 0x20–0x7E (inclusive) as
well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the
same (zero-extended) value. Character encodings that have this
property (hereafter "rough ASCII superset") include:
Big5
Big5-HKSCS
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM00858
IBM437
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM865
IBM866
IBM868
IBM869
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
KOI8-R
KOI8-U
MacRoman
Shift_JIS
TIS-620
US-ASCII
UTF-8
VISCII
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-ARMSCII
x-Big5-Solaris
x-EUC-TW
x-IBM1006
x-IBM1046
x-IBM1098
x-IBM1124
x-IBM1381
x-IBM1383
x-IBM737
x-IBM856
x-IBM874
x-IBM921
x-IBM922
x-IBM942C
x-IBM943C
x-IBM948
x-IBM949C
x-IBM950
x-IBM970
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-JISAutoDetect
x-Johab
x-MS950-HKSCS
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRomania
x-MacThai
x-MacTurkish
x-MacUkraine
x-PCK
x-euc-jp-linux
x-eucJP-Open
x-iso-8859-11
x-iso-8859-12
x-mswin-936
x-windows-874
x-windows-949
x-windows-950
Notably, character encodings that I am aware of and do not have this
property are:
JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat
and x-MacSymbol, UTF-7, UTF-16 and UTF-32.
The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web
pages. After browsing the encoding menus of Firefox, Opera and
Safari, I'm pretty confident that the legacy IBM codepages are
irrelevant as well.
I suggest the following algorithm as a starting point. It does not
handle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.
- -
Set the REWIND flag to unraised.
Read the first four bytes of the byte stream.
If the bytes constitute a big-endian UTF-32 BOM, set the character
encoding to big-endian UTF-32 and initialize the corresponding
decoder. The detection algorithm terminates.
If the bytes constitute a little-endian UTF-32 BOM, set the character
encoding to littel-endian UTF-32 and initialize the corresponding
decoder. The detection algorithm terminates.
If the first two bytes constitute a big-endian UTF-16 BOM, set the
character encoding to big-endian UTF-16, unread the third and fourth
byte and initialize the corresponding decoder. The detection
algorithm terminates.
If the first two bytes constitute a little-endian UTF-16 BOM, set the
character encoding to little-endian UTF-16, unread the third and
fourth byte and initialize the corresponding decoder. The detection
algorithm terminates.
If the first three bytes constitute a UTF-8 BOM, set the character
encoding to UTF-8, unread the fourth byte and initialize the
corresponding decoder. The detection algorithm terminates.
If the bytes have the pattern 0x00, 0x00, 0x00, 0x00, emit a hard
parse error, unread the bytes and perform implementation-specific
heuristics. Set the character encoding to the output of the
heuristics. The detection algorithm terminates. (Note: need more
testing here.)
If the bytes have the pattern 0x00, 0x00, 0x00, NOT-0x00, set the
character encoding to UTF-32BE, emit an easy parse error, unread the
bytes and initialize the corresponding decoder. The detection
algorithm terminates. (Note: need more testing here.)
If the bytes have the pattern NOT-0x00, 0x00, 0x00, 0x00, set the
character encoding to UTF-32LE, emit an easy parse error, unread the
bytes and initialize the corresponding decoder. The detection
algorithm terminates. (Note: need more testing here.)
If the first two bytes have the pattern 0x00, NOT-0x00, set the
character encoding to UTF-16BE, emit an easy parse error, unread the
bytes and initialize the corresponding decoder. The detection
algorithm terminates. (Note: need more testing here.)
If the first two bytes have the pattern NOT-0x00, 0x00, set the
character encoding to UTF-16LE, emit an easy parse error, unread the
bytes and initialize the corresponding decoder. The detection
algorithm terminates. (Note: need more testing here.)
Initialize a character decoder that the bytes 0x20–0x7E (inclusive)
as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of
the same (zero-extended) value and maps all other bytes to U+FFFD and
raises a REWIND flag and emits an easy parse error when doing so. If
the UA supports in-place decoder switching (see below), the decoder
should not buffer and should only consume one byte of the byte stream
when one character is read from the decoder.
Start the HTML parser but do not execute scripts.
If the script start tag is seen and the UA supports scripting, raise
the REWIND flag and emit an easy parse error.
If a start tag other than html or head is seen, emit an easy parse
error.
If the end of the head element is seen, emit a hard parse error,
perform implementation-specific heuristics, tear down the DOM, rewind
the byte stream and restart the parser. The detection algorithm
terminates.
If a meta element whose http-equiv attribute has the value "Content-
Type" (compare case-insensitively) and whose content attribute has a
value that begins with "text/html; charset=", the string in the
content attribute following the start "text/html; charset=" is taken,
white space removed from the sides and considered the tentative
encoding name.
(Note: Safari allows spaces, line breaks and tabs around the
attribute values. Firefox allows spaces. Opera does not allow
anything extra.)
If the tentative encoding name does not identify a rough ASCII
superset supported by the UA, emit a hard parse error and perform
implementation-specific heuristics. Set the character encoding to the
output of the heuristics. If the REWIND flag has been raised, rewind
the byte stream and tear down the DOM. If the REWIND flag has not
been raised and the heuristics yield a rough ASCII superset, either
change the decoder in place or rewind the byte stream, tear down the
DOM and restart the parser. (Changing in place is recommended.) The
detection algorithm terminates.
If the tentative encoding name identifies a rough ASCII superset
supported by the UA, set the character encoding to the tentative
encoding. If the REWIND flag has been raised, rewind the byte stream
and tear down the DOM. If the REWIND flag has not been raised, either
change the decoder in place or rewind the byte stream, tear down the
DOM and restart the parser. (Changing in place is recommended.) The
detection algorithm terminates.
Where performing implementation-specific heuristics is called for,
the UA may analyze the byte spectrum using statistical methods.
However, at minimum the UA must fall back on a user-chosen encoding
that is rough ASCII subset. This user choice should default to
Windows-1252.
- -
Requirements I'd like to see:
Documents must specify a character encoding an must use an IANA-
registered encoding and must identify it using its preferred MIME
name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize
the preferred MIME name of every encoding they support that has a
preferred MIME name. UAs should recognize IANA-registered aliases.
Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE
(i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from
the EBCDIC family of encodings. Documents using the UTF-16 or UTF-32
encodings must have a BOM.
UAs must support the UTF-8 encoding.
Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)
Authors are adviced to use the UTF-8 encoding. Authors are adviced
not to use the UTF-32 encoding or legacy encodings. (Note: I think
UTF-32 on the Web is harmful and utterly pointless, but Firefox and
Opera support it. Also, I'd like to have some text in the spec that
justifies whining about legacy encodings. On the XML side, I give
warnings if the encoding is not UTF-8, UTF-16, US-ASCII or
ISO-8859-1. I also warn about aliases and potential trouble with RFC
3023 rules. However, I have no spec backing for treating dangerous
RFC 3023 stuff as errors.)
- -
Also, the spec should probably give guidance on what encodings need
to be supported. That set should include at least UTF-8, US-ASCII,
ISO-8859-1 and Windows-1252. It should probably not be larger than
the intersection of the sets of encodings supported by Firefox,
Opera, Safari and IE6. (It might even be useful to intersect that set
with the encodings supported by JDK and Python by default.)
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list