[whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
Jukka K. Korpela
jkorpela at cs.tut.fi
Tue Jul 2 00:05:11 PDT 2013
2013-07-02 2:16, Ian Hickson wrote:
> The reason that ISO-8859-1 is currently non-conforming is that the label
> no longer means "ISO-8859-1", as defined by the ISO. It actually means
> "Windows-1252".
Declaring ISO-8859-1 has no problems when the document does not contain
bytes in the range 0x80...0x9F, as it should not. There is a huge number
of existing pages to which this applies, and they are valid by HTML 4.01
(or, as the case may be, XHTML 1.0) rules. Declaring all of them as
non-conforming and issuing an error message about them does not seem to
be useful.
You might say that such pages are risky and the risk should be
announced, because if the page is later changed so that contains a byte
in that range, it will not be interpreted by ISO-8859-1 but by
windows-1252. From the perspective of tradition and practice, this is
just about error handling. By HTML 4.01, those bytes should be
interpreted as control characters according to ISO-8859-1, and this
would make the document invalid, since those control characters are
disallowed in HTML 4.01. Thus, whatever browsers do with the document
then is error processing, and nowadays probably all browsers have chosen
to interpret them by windows-1252.
Admittedly, in XHTML syntax it’s different since those control
characters are not forbidden but (mostly) “just” discouraged.
I think the simplest approach would be to declare U+0080...U+009F as
forbidden in both serializations. Then the issue could be defined purely
in terms of error handling. If you declare ISO-8859-1 and do not have
bytes 0x80...0x9F, fine. If you do have such a byte, we should still
treat the encoding declaration as conforming as such, but validators
should report the characters as errors and browsers should handle this
error by interpreting the document as if the declared encoding were
windows-1252.
> It seems bad, and maybe rather full of hubris, to make it conforming to
> use a label that we know will be interpreted in a manner that is a willful
> violation of its spec (that is, the ISO spec).
In most cases, there is no violation of the ISO standard. Or, to put it
in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully
compatible with the ISO 8859-1 standard as long as the document does not
contain data that would be interpreted by ISO 8859-1 as C1 Controls
(U+0080...U+009F), which it should not contain.
> I would rather go back to having the conflicts be caught by validators
> than just throw the ISO spec under the bus, but it's really up to you
> (Henri, and whoever else is implementing a validator).
Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has
done for years, and remains happy, until he tries to validate his page
as HTML5. Is it useful that he gets an error message (and gets
confused), even though his data is all ISO-8859-1 (without C1 Controls)?
Suppose then than he accidentally enters, say, the euro sign “€” because
his text editor or other authoring tool lets him do – and stores it as
windows-1252 encoded. Even then, no practical problem arises, due to the
common error handling behavior, but at this point, it might be useful to
give some diagnostic if the document is being validated.
I would say that even then a warning about the problem would be
sufficient, but it could be treated as an error – as a data error, with
defined error handling. The occurrences of the offending bytes should be
reported (which is what now happens when validating as HTML 4.01, even
though the error messages are cryptic, like “non SGML character number
128”). The author might then decide to declare the encoding as windows-1252.
But even though the most common cause of such a situation is an attempt
to use (mostly due to ignorance) certain characters without realizing
that they do not exist in ISO-8859-1, it might be a symptom of some
different problem, like malformed data unintentionally appearing in a
document. It is thus useful to draw the author’s attention to specific
problems, incorrect data where it appears, rather than blindly taking
ISO-8859-1 as windows-1252.
Yucca
More information about the whatwg
mailing list