[whatwg] Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology

Sat Jun 23 02:35:51 PDT 2007

On Sat, 11 Mar 2006, Henri Sivonen wrote:
> 
> I think allowing in-place decoder change (when feasible) would be good 
> for performance.

Done.

> > > I think it would be beneficial to additionally stipulate that
> > >
> > > 1. The meta element-based character encoding information declaration 
> > > is expected to work only if the Basic Latin range of characters maps 
> > > to the same bytes as in the US-ASCII encoding.
> > 
> > Is this realistic? I'm not really familiar enough with character 
> > encodings to say if this is what happens in general.
> 
> I suppose it is realistic. See below.

That was already there, turns out.

> > > 2. If there is no external character encoding information nor a BOM 
> > > (see below), there MUST NOT be any non-ASCII bytes in the document 
> > > byte stream before the end of the meta element that declares the 
> > > character encoding. (In practice this would ban unescaped non-ASCII 
> > > class names on the html and [head] elements and non-ASCII comments 
> > > at the beginning of the document.)
> > 
> > Again, can we realistically require this? I need to do some studies of 
> > non-latin pages, I guess.
> 
> As UA behavior, no. As a conformance requirement, maybe.

I don't think we should require this, given the preparse step. I can if 
people think we should, though.

> > > > Authors should avoid including inline character encoding 
> > > > information. Character encoding information should instead be 
> > > > included at the transport level (e.g. using the HTTP Content-Type 
> > > > header).
> > > 
> > > I disagree.
> > > 
> > > With HTML with contemporary UAs, there is no real harm in including 
> > > the character encoding information both on the HTTP level and in the 
> > > meta as long as the information is not contradictory. On the 
> > > contrary, the author-provided internal information is actually 
> > > useful when end users save pages to disk using UAs that do not 
> > > reserialize with internal character encoding information.
> > 
> > ...and it breaks everything when you have a transcoding proxy, or 
> > similar.
> 
> Well, not until you save to disk, since HTTP takes precedence. However, 
> authors can escape this by using UTF-8. (Assuming here that tampering 
> with UTF-8 would be harmful, wrong and pointless.)
> 
> Interestingly, transcoding proxies tend to be brought up by residents of 
> Western Europe, North America or the Commonwealth. I have never seen a 
> Russion person living in Russia or a Japanese person living in Japan 
> talk about transcoding proxies in any online or offline discussion. 
> That's why I doubt the importance of transcoding proxies.

I think this discouragement has been removed now. Let me know if it lives 
on somewhere.

> > Character encoding information shouldn't be duplicated, IMHO, that's 
> > just asking for trouble.
> 
> I suggest a mismatch be considered an easy parse error and, therefore, 
> reportable.

I believe this is required in the spec.

> > > > For HTML, user agents must use the following algorithm in 
> > > > determining the character encoding of a document:
> > > > 1. If the transport layer specifies an encoding, use that.
> > > 
> > > Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; 
> > > UTF-32 makes no practical sense for interchange on the Web.)
> > 
> > I don't know, should there?
> 
> I believe there should.

There's a BOM step in the spec; let me know if you think it's in the wrong 
place.

> > > > 2. Otherwise, if the user agent can find a meta element that 
> > > > specifies character encoding information (as described above), 
> > > > then use that.
> > > 
> > > If a conformance checker has not determined the character encoding 
> > > by now, what should it do? Should it report the document as 
> > > non-conforming (my preferred choice)? Should it default to US-ASCII 
> > > and report any non-ASCII bytes as conformance errors? Should it 
> > > continue to the fuzzier steps like browsers would (hopefully not)?
> > 
> > Again, I don't know.
> 
> I'll continue to treat such documents as non-conforming, then.

I've made it non-conforming to not use ASCII if you've got no encoding 
information and no BOM.

> Notably, character encodings that I am aware of and [aren't 
> ASCII-compatible] are:
>
> JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat 
> and x-MacSymbol, UTF-7, UTF-16 and UTF-32.
> 
> The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. 
> After browsing the encoding menus of Firefox, Opera and Safari, I'm 
> pretty confident that the legacy IBM codepages are irrelevant as well.
> 
> I suggest the following algorithm as a starting point. It does not handle
> UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.

I've made those either MUST NOTs or SHOULD NOTs, amongst others.

> Set the REWIND flag to unraised.

The REWIND idea sadly doesn't work very well given that you can actually 
have things like javascript: URIs and event handlers that execute on 
content in the <head>, in pathological cases.

However, I did something similar in the spec as it stands now.

> Requirements I'd like to see:
> 
> Documents must specify a character encoding an must use an 
> IANA-registered encoding and must identify it using its preferred MIME 
> name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the 
> preferred MIME name of every encoding they support that has a preferred 
> MIME name. UAs should recognize IANA-registered aliases.

Done.

> Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. 
> BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC 
> family of encodings. Documents using the UTF-16 or UTF-32 encodings must 
> have a BOM.

Done except for UTF-16BE and UTF-16LE, though you might want to check 
that the spec says exactly what you want.

> UAs must support the UTF-8 encoding.

Done.

> Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Encoding errors are covered by the encoding specs.

> Authors are adviced to use the UTF-8 encoding. Authors are adviced not 
> to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on 
> the Web is harmful and utterly pointless, but Firefox and Opera support 
> it. Also, I'd like to have some text in the spec that justifies whining 
> about legacy encodings. On the XML side, I give warnings if the encoding 
> is not UTF-8, UTF-16, US-ASCII or ISO-8859-1. I also warn about aliases 
> and potential trouble with RFC 3023 rules. However, I have no spec 
> backing for treating dangerous RFC 3023 stuff as errors.)

Done, except about the RFC3023 stuff. Could you elaborate on that? I don't 
really have anything about encodings and XML in the spec.

> Also, the spec should probably give guidance on what encodings need to 
> be supported. That set should include at least UTF-8, US-ASCII, 
> ISO-8859-1 and Windows-1252. It should probably not be larger than the 
> intersection of the sets of encodings supported by Firefox, Opera, 
> Safari and IE6. (It might even be useful to intersect that set with the 
> encodings supported by JDK and Python by default.)

Made it just UTF-8 and Win1252.

On Sat, 11 Mar 2006, Henri Sivonen wrote:
> On Mar 11, 2006, at 17:10, Henri Sivonen wrote:
> 
> > Where performing implementation-specific heuristics is called for, the 
> > UA may analyze the byte spectrum using statistical methods. However, 
> > at minimum the UA must fall back on a user-chosen encoding that is 
> > rough ASCII subset. This user choice should default to Windows-1252.
> 
> This will violate Charmod, but what can you do?

Indeed. (The HTML5 spec says the above.)

On Sun, 12 Mar 2006, Henri Sivonen wrote:
> 
> On further reflection, it occurred to me that emitting the Windows-1252 
> characters instead of U+FFFD would be a good optimization for the common 
> case where the encoding later turns out to be Windows-1252 or 
> ISO-8859-1. This would require more that one bookkeeping flag, though.

Required, always.

On Sun, 12 Mar 2006, Henri Sivonen wrote:
> 
> For ISO-8859-* family encodings that have a corresponding Windows-* 
> family superset (e.g. Windows-1252 for ISO-8859-1) the UA must use the 
> Windows-* family superset decoder instead of the ISO-8859-* family 
> decoder. However, any bytes in the 0x80–0x9F (inclusive) are easy 
> parse errors.

That isn't what the spec says, but I have other outstanding comments on 
this to deal with still.

> I would like the spec to say that if the page has forms, using an 
> encoding other than UTF-8 is trouble. And even for pages that don't have 
> forms, using an encoding that is not known to be extremely well 
> supported introduces incompatibility for no good reason.

Does the current text (which doesn't mention forms) satisfy you?

On Tue, 14 Mar 2006, Lachlan Hunt wrote:
> 
> This will need to handle common mistakes such as the following:
> 
> <meta ... content="application/xhtml+xml;charset=X">
> <meta ... content="foo/bar;charset=X">
> <meta ... content="foo/bar;charset='X'">
> <meta ... content="charset=X">
> <meta ... charset="X">

The ones that matter are now in the spec, as far as I can tell.

On Tue, 14 Mar 2006, Peter Karlsson wrote:
> > 
> > Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE 
> > (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the 
> > EBCDIC family of encodings. Documents using the UTF-16 or UTF-32 
> > encodings must have a BOM.
> 
> I don't think forbidding BOCU-1 is a good idea. If there is ever a 
> proper specification written of it, it could be very useful as a 
> compression format for documents.

BOCU-1 has been used for security attacks. It's on the "no fly" list.

On Tue, 15 May 2007, Michael Day wrote:
> 
> Suggestion: drop UTF-32 from the character encoding detection section of 
> HTML5, and even better, discourage of forbid user agents from 
> implementing support for UTF-32.

Done.

On Wed, 16 May 2007, Geoffrey Sneddon wrote:
> 
> Including it in a few encoding detection algorithms is no big deal on us 
> implementers: as the spec stands we aren't required to support it 
> anyway. All the spec requires is that we include it within our encoding 
> detections (so, if we don't support it, we can then reject it).

Right now it's not even being detected by the spec.

On Mon, 4 Jun 2007, Henri Sivonen wrote:
> 
> What's the right thing for an implementation to do when UTF-32 is not 
> supported? Decode as Windows-1252? Does that make sense?

That's basically what the spec requires now.

On Mon, 4 Jun 2007, Alexey Feldgendler wrote:
> 
> Seems like a general question: what's the right thing to do when the 
> document's encoding is not supported? There isn't a reasonable fallback 
> for every encoding.

The spec right now requires UAs to ignore <meta charset=""> declarations 
they don't understand.

On Mon, 4 Jun 2007, Henri Sivonen wrote:
> 
> I think it is perfectly reasonable to make support for UTF-8 and 
> Windows-1252 part of UA conformance requirements. After all, a piece of 
> software that doesn't support those two really has no business 
> pretending to be a UA for the World Wide Web. Not supporting 
> Windows-1252 based on "local market" arguments is serious 
> walled-gardenism.

Indeed.

On Mon, 4 Jun 2007, Alexey Feldgendler wrote:
> 
> On the other hand, declaring Windows-1252 as the default encoding is 
> monoculturalism. For example, in Russia, whenever Windows-1252 is 
> chosen, it is definitely a wrong choice. It's never used in Russia 
> because it doesn't contain Cyrillic letters. A default of Windows-1251 
> or KOI8-R might be reasonable in Russia, though none of them is a 100% 
> safe guess.

The spec allows any guess.

On Sun, 27 May 2007, Henri Sivonen wrote:
>
> "If the encoding is one of UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or 
> UTF-32LE, then authors can use a BOM at the start of the file to 
> indicate the character encoding."
> 
> That sentence should read:

That sentence is now gone. The "writing HTML" section generically allows 
leading BOMs regardless of character encoding.

> The encoding labels with LE or BE in them mean BOMless variants where 
> the encoding label on the transfer protocol level gives the endianness. 
> See http://www.ietf.org/rfc/rfc2781.txt When the spec refers to UTF-16 
> with BOM in a particular endianness, I think the spec should use 
> "big-endian UTF-16" and "little-endian UTF-16".
> 
> Since declaring endianness on the transfer protocol level has no benefit 
> over using the BOM when the label is right and there's a chance to get 
> the label wrong, the encoding labels with explicit endianness are 
> harmful for interchange. In my opinion, the spec should avoid giving 
> authors any bad ideas by reinforcing these labels by repetition.

If you know the encoding before going in (e.g. it's in Content-Type 
metadata) then if the BOM is correctly encoded you just ignore it, and if 
it's incorrectly encoded then you won't see it as a BOM and you'll 
probably treat it as U+FFFD. From an authoring standpoint, especially 
given that tools now tend to output BOMs silently (e.g. Notepad), and 
*especially* considering that a BOM is invisible, it would just be a pain 
to have to take out the first character in certain cases. No?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'