[whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

Wed Jul 31 13:33:13 PDT 2013

On Mon, 1 Jul 2013, Glenn Maynard wrote:
> On Mon, Jul 1, 2013 at 6:16 PM, Ian Hickson <ian at hixie.ch> wrote:
> >
> > It seems bad, and maybe rather full of hubris, to make it conforming 
> > to use a label that we know will be interpreted in a manner that is a 
> > willful violation of its spec (that is, the ISO spec).
> 
> It's hard enough to get people to label their encodings in the first 
> place.  It doesn't seem like a good idea to spend people's limited 
> attention on encodings with "you should change your encoding label, even 
> though what you already have will always work", especially given how 
> widespread the ISO-8859-1 label is.

Fair enough.

> (FWIW, I wouldn't change a server to say windows-1252.  The ISO spec is 
> so far out of touch with reality that it's hard to consider it 
> authoritative; in reality, ISO-8859-1 is 1252.)

It certainly seems that that is how most software interprets it.

On Tue, 2 Jul 2013, Jukka K. Korpela wrote:

> 2013-07-02 2:16, Ian Hickson wrote:
> > 
> > The reason that ISO-8859-1 is currently non-conforming is that the 
> > label no longer means "ISO-8859-1", as defined by the ISO. It actually 
> > means "Windows-1252".
> 
> Declaring ISO-8859-1 has no problems when the document does not contain 
> bytes in the range 0x80...0x9F, as it should not. There is a huge number 
> of existing pages to which this applies, and they are valid by HTML 4.01 
> (or, as the case may be, XHTML 1.0) rules. Declaring all of them as 
> non-conforming and issuing an error message about them does not seem to 
> be useful.

Right. I note that you omitted to quote the following from my original 
e-mail: "Previously, this was also somewhat the case, but it was only an 
error to use ISO-8859-1 in a manner that was not equivalent across both 
encodings (there was the concept of "misinterpreted for compatibility"). 
This was removed with the move to the Encoding spec."

This kind of error handling is what I would personally prefer.

> You might say that such pages are risky and the risk should be announced,
> because if the page is later changed so that contains a byte in that range, it
> will not be interpreted by ISO-8859-1 but by windows-1252.

Honestly merely not using UTF-8 is far more risky than the difference 
between 8859-1 and 1252. The encoding of the page is also the encoding 
used in a bunch of outgoing (encoding) features, and users aren't going to 
conveniently limit themselves to the character set of the encoding of the 
page when e.g. submitting forms.

> I think the simplest approach would be to declare U+0080...U+009F as 
> forbidden in both serializations.

I don't see any point in making them non-conforming in actual Win1252 
content. That's not harmful.

> > It seems bad, and maybe rather full of hubris, to make it conforming 
> > to use a label that we know will be interpreted in a manner that is a 
> > willful violation of its spec (that is, the ISO spec).
> 
> In most cases, there is no violation of the ISO standard. Or, to put it 
> in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully 
> compatible with the ISO 8859-1 standard as long as the document does not 
> contain data that would be interpreted by ISO 8859-1 as C1 Controls 
> (U+0080...U+009F), which it should not contain.

It's still a violation.

I'm not saying we shouldn't violate it; it's clearly the right thing to 
do. But despite having many willful violations of other standards in the 
HTML standard, I wouldn't want us to ever get to a stage where we were 
casual in our violations, or where we minimised or dismissed the issue.

> > I would rather go back to having the conflicts be caught by validators 
> > than just throw the ISO spec under the bus, but it's really up to you 
> > (Henri, and whoever else is implementing a validator).
> 
> Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has 
> done for years, and remains happy, until he tries to validate his page 
> as HTML5. Is it useful that he gets an error message (and gets 
> confused), even though his data is all ISO-8859-1 (without C1 Controls)? 

No, it's not. Like I said, I would rather go back to having the conflicts 
be caught by validators.

> Suppose then than he accidentally enters, say, the euro sign “€” because 
> his text editor or other authoring tool lets him do – and stores it as 
> windows-1252 encoded. Even then, no practical problem arises, due to the 
> common error handling behavior, but at this point, it might be useful to 
> give some diagnostic if the document is being validated.

Right.

Unfortunately it seems you and I are alone in thinking this.

> I would say that even then a warning about the problem would be sufficient,
> but it could be treated as an error

There's not really a difference, in a validator.

In any case, I've changed the spec to allow any label to be used for an 
encoding.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'