[whatwg] Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology
Ian Hickson
ian at hixie.ch
Sat Jun 23 02:35:51 PDT 2007
On Sat, 11 Mar 2006, Henri Sivonen wrote:
>
> I think allowing in-place decoder change (when feasible) would be good
> for performance.
Done.
> > > I think it would be beneficial to additionally stipulate that
> > >
> > > 1. The meta element-based character encoding information declaration
> > > is expected to work only if the Basic Latin range of characters maps
> > > to the same bytes as in the US-ASCII encoding.
> >
> > Is this realistic? I'm not really familiar enough with character
> > encodings to say if this is what happens in general.
>
> I suppose it is realistic. See below.
That was already there, turns out.
> > > 2. If there is no external character encoding information nor a BOM
> > > (see below), there MUST NOT be any non-ASCII bytes in the document
> > > byte stream before the end of the meta element that declares the
> > > character encoding. (In practice this would ban unescaped non-ASCII
> > > class names on the html and [head] elements and non-ASCII comments
> > > at the beginning of the document.)
> >
> > Again, can we realistically require this? I need to do some studies of
> > non-latin pages, I guess.
>
> As UA behavior, no. As a conformance requirement, maybe.
I don't think we should require this, given the preparse step. I can if
people think we should, though.
> > > > Authors should avoid including inline character encoding
> > > > information. Character encoding information should instead be
> > > > included at the transport level (e.g. using the HTTP Content-Type
> > > > header).
> > >
> > > I disagree.
> > >
> > > With HTML with contemporary UAs, there is no real harm in including
> > > the character encoding information both on the HTTP level and in the
> > > meta as long as the information is not contradictory. On the
> > > contrary, the author-provided internal information is actually
> > > useful when end users save pages to disk using UAs that do not
> > > reserialize with internal character encoding information.
> >
> > ...and it breaks everything when you have a transcoding proxy, or
> > similar.
>
> Well, not until you save to disk, since HTTP takes precedence. However,
> authors can escape this by using UTF-8. (Assuming here that tampering
> with UTF-8 would be harmful, wrong and pointless.)
>
> Interestingly, transcoding proxies tend to be brought up by residents of
> Western Europe, North America or the Commonwealth. I have never seen a
> Russion person living in Russia or a Japanese person living in Japan
> talk about transcoding proxies in any online or offline discussion.
> That's why I doubt the importance of transcoding proxies.
I think this discouragement has been removed now. Let me know if it lives
on somewhere.
> > Character encoding information shouldn't be duplicated, IMHO, that's
> > just asking for trouble.
>
> I suggest a mismatch be considered an easy parse error and, therefore,
> reportable.
I believe this is required in the spec.
> > > > For HTML, user agents must use the following algorithm in
> > > > determining the character encoding of a document:
> > > > 1. If the transport layer specifies an encoding, use that.
> > >
> > > Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only;
> > > UTF-32 makes no practical sense for interchange on the Web.)
> >
> > I don't know, should there?
>
> I believe there should.
There's a BOM step in the spec; let me know if you think it's in the wrong
place.
> > > > 2. Otherwise, if the user agent can find a meta element that
> > > > specifies character encoding information (as described above),
> > > > then use that.
> > >
> > > If a conformance checker has not determined the character encoding
> > > by now, what should it do? Should it report the document as
> > > non-conforming (my preferred choice)? Should it default to US-ASCII
> > > and report any non-ASCII bytes as conformance errors? Should it
> > > continue to the fuzzier steps like browsers would (hopefully not)?
> >
> > Again, I don't know.
>
> I'll continue to treat such documents as non-conforming, then.
I've made it non-conforming to not use ASCII if you've got no encoding
information and no BOM.
> Notably, character encodings that I am aware of and [aren't
> ASCII-compatible] are:
>
> JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat
> and x-MacSymbol, UTF-7, UTF-16 and UTF-32.
>
> The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages.
> After browsing the encoding menus of Firefox, Opera and Safari, I'm
> pretty confident that the legacy IBM codepages are irrelevant as well.
>
> I suggest the following algorithm as a starting point. It does not handle
> UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.
I've made those either MUST NOTs or SHOULD NOTs, amongst others.
> Set the REWIND flag to unraised.
The REWIND idea sadly doesn't work very well given that you can actually
have things like javascript: URIs and event handlers that execute on
content in the <head>, in pathological cases.
However, I did something similar in the spec as it stands now.
> Requirements I'd like to see:
>
> Documents must specify a character encoding an must use an
> IANA-registered encoding and must identify it using its preferred MIME
> name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the
> preferred MIME name of every encoding they support that has a preferred
> MIME name. UAs should recognize IANA-registered aliases.
Done.
> Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e.
> BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC
> family of encodings. Documents using the UTF-16 or UTF-32 encodings must
> have a BOM.
Done except for UTF-16BE and UTF-16LE, though you might want to check
that the spec says exactly what you want.
> UAs must support the UTF-8 encoding.
Done.
> Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)
Encoding errors are covered by the encoding specs.
> Authors are adviced to use the UTF-8 encoding. Authors are adviced not
> to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on
> the Web is harmful and utterly pointless, but Firefox and Opera support
> it. Also, I'd like to have some text in the spec that justifies whining
> about legacy encodings. On the XML side, I give warnings if the encoding
> is not UTF-8, UTF-16, US-ASCII or ISO-8859-1. I also warn about aliases
> and potential trouble with RFC 3023 rules. However, I have no spec
> backing for treating dangerous RFC 3023 stuff as errors.)
Done, except about the RFC3023 stuff. Could you elaborate on that? I don't
really have anything about encodings and XML in the spec.
> Also, the spec should probably give guidance on what encodings need to
> be supported. That set should include at least UTF-8, US-ASCII,
> ISO-8859-1 and Windows-1252. It should probably not be larger than the
> intersection of the sets of encodings supported by Firefox, Opera,
> Safari and IE6. (It might even be useful to intersect that set with the
> encodings supported by JDK and Python by default.)
Made it just UTF-8 and Win1252.
On Sat, 11 Mar 2006, Henri Sivonen wrote:
> On Mar 11, 2006, at 17:10, Henri Sivonen wrote:
>
> > Where performing implementation-specific heuristics is called for, the
> > UA may analyze the byte spectrum using statistical methods. However,
> > at minimum the UA must fall back on a user-chosen encoding that is
> > rough ASCII subset. This user choice should default to Windows-1252.
>
> This will violate Charmod, but what can you do?
Indeed. (The HTML5 spec says the above.)
On Sun, 12 Mar 2006, Henri Sivonen wrote:
>
> On further reflection, it occurred to me that emitting the Windows-1252
> characters instead of U+FFFD would be a good optimization for the common
> case where the encoding later turns out to be Windows-1252 or
> ISO-8859-1. This would require more that one bookkeeping flag, though.
Required, always.
On Sun, 12 Mar 2006, Henri Sivonen wrote:
>
> For ISO-8859-* family encodings that have a corresponding Windows-*
> family superset (e.g. Windows-1252 for ISO-8859-1) the UA must use the
> Windows-* family superset decoder instead of the ISO-8859-* family
> decoder. However, any bytes in the 0x80–0x9F (inclusive) are easy
> parse errors.
That isn't what the spec says, but I have other outstanding comments on
this to deal with still.
> I would like the spec to say that if the page has forms, using an
> encoding other than UTF-8 is trouble. And even for pages that don't have
> forms, using an encoding that is not known to be extremely well
> supported introduces incompatibility for no good reason.
Does the current text (which doesn't mention forms) satisfy you?
On Tue, 14 Mar 2006, Lachlan Hunt wrote:
>
> This will need to handle common mistakes such as the following:
>
> <meta ... content="application/xhtml+xml;charset=X">
> <meta ... content="foo/bar;charset=X">
> <meta ... content="foo/bar;charset='X'">
> <meta ... content="charset=X">
> <meta ... charset="X">
The ones that matter are now in the spec, as far as I can tell.
On Tue, 14 Mar 2006, Peter Karlsson wrote:
> >
> > Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE
> > (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the
> > EBCDIC family of encodings. Documents using the UTF-16 or UTF-32
> > encodings must have a BOM.
>
> I don't think forbidding BOCU-1 is a good idea. If there is ever a
> proper specification written of it, it could be very useful as a
> compression format for documents.
BOCU-1 has been used for security attacks. It's on the "no fly" list.
On Tue, 15 May 2007, Michael Day wrote:
>
> Suggestion: drop UTF-32 from the character encoding detection section of
> HTML5, and even better, discourage of forbid user agents from
> implementing support for UTF-32.
Done.
On Wed, 16 May 2007, Geoffrey Sneddon wrote:
>
> Including it in a few encoding detection algorithms is no big deal on us
> implementers: as the spec stands we aren't required to support it
> anyway. All the spec requires is that we include it within our encoding
> detections (so, if we don't support it, we can then reject it).
Right now it's not even being detected by the spec.
On Mon, 4 Jun 2007, Henri Sivonen wrote:
>
> What's the right thing for an implementation to do when UTF-32 is not
> supported? Decode as Windows-1252? Does that make sense?
That's basically what the spec requires now.
On Mon, 4 Jun 2007, Alexey Feldgendler wrote:
>
> Seems like a general question: what's the right thing to do when the
> document's encoding is not supported? There isn't a reasonable fallback
> for every encoding.
The spec right now requires UAs to ignore <meta charset=""> declarations
they don't understand.
On Mon, 4 Jun 2007, Henri Sivonen wrote:
>
> I think it is perfectly reasonable to make support for UTF-8 and
> Windows-1252 part of UA conformance requirements. After all, a piece of
> software that doesn't support those two really has no business
> pretending to be a UA for the World Wide Web. Not supporting
> Windows-1252 based on "local market" arguments is serious
> walled-gardenism.
Indeed.
On Mon, 4 Jun 2007, Alexey Feldgendler wrote:
>
> On the other hand, declaring Windows-1252 as the default encoding is
> monoculturalism. For example, in Russia, whenever Windows-1252 is
> chosen, it is definitely a wrong choice. It's never used in Russia
> because it doesn't contain Cyrillic letters. A default of Windows-1251
> or KOI8-R might be reasonable in Russia, though none of them is a 100%
> safe guess.
The spec allows any guess.
On Sun, 27 May 2007, Henri Sivonen wrote:
>
> "If the encoding is one of UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or
> UTF-32LE, then authors can use a BOM at the start of the file to
> indicate the character encoding."
>
> That sentence should read:
That sentence is now gone. The "writing HTML" section generically allows
leading BOMs regardless of character encoding.
> The encoding labels with LE or BE in them mean BOMless variants where
> the encoding label on the transfer protocol level gives the endianness.
> See http://www.ietf.org/rfc/rfc2781.txt When the spec refers to UTF-16
> with BOM in a particular endianness, I think the spec should use
> "big-endian UTF-16" and "little-endian UTF-16".
>
> Since declaring endianness on the transfer protocol level has no benefit
> over using the BOM when the label is right and there's a chance to get
> the label wrong, the encoding labels with explicit endianness are
> harmful for interchange. In my opinion, the spec should avoid giving
> authors any bad ideas by reinforcing these labels by repetition.
If you know the encoding before going in (e.g. it's in Content-Type
metadata) then if the BOM is correctly encoded you just ignore it, and if
it's incorrectly encoded then you won't see it as a BOM and you'll
probably treat it as U+FFFD. From an authoring standpoint, especially
given that tools now tend to output BOMs silently (e.g. Notepad), and
*especially* considering that a BOM is invisible, it would just be a pain
to have to take out the first character in certain cases. No?
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list