[whatwg] Relationship to Charmod and Charmod Norm
hsivonen at iki.fi
Sat Nov 11 05:35:56 PST 2006
On Nov 11, 2006, at 01:13, François Yergeau wrote:
> Henri Sivonen a écrit :
>> Does C003 in Charmod outlaw bdo?
> Nope. bdo is simply an assertion by the author that the
> presentation order is not the usual one for the script. The text
> is still stored, interchanged and processed in logical order.
>> I think C073 shouldn't render a document non-conforming.
> Disagree. C073 is a SHOULD NOT and it should carry over to HTML
> conformance stricto sensu (i.e. as per RFC 2119).
I agree that, in general, PUA characters aren't suitable for public
interchange. However, I don't think it is necessarily a good idea to
make a conformance checker proclaim documents that contain them non-
conforming. I do think that a warning is called for. See also C040.
There are cases when PUA characters are the best available way to
I have tried hard to avoid marketing the would-be conformance
checking service the same way fanboys market the W3C Validator. I
intend to conformance checking service to be a tool that helps
authors--not a graven image that needs to be satisfied at all cost.
Regardless, I need to consider what kind of behavior the conformance
checking service could induce among those who don't see the big
picture but want their documents to have zero errors reported.
If the use of PUA characters were errors, the people who want zero
errors from a conformance checker at all cost could move from
violating C073 to violating C076, which would be much worse but not
detectable by a conformance checker. (I'm not suggesting that Everson
& Cowan would do this, but, you know, others. :-)
>> Would it be too annaying to emit a warning? Perhaps one warning
>> per document rather than per character?
> No more than one per doc, please!
>> I think authors wouldn't like warnings on C047 and C048.
> Perhaps, perhaps not. Some authors want their apps to keep them as
> close to spec as possible. Authoring tools should certainly abide
> by C047 and C048 when generating escapes on behalf of the author.
>> Moreover, I think it should be concluded that Charmod SHOULD
>> violation don't make an (X)HTML5 document non-conforming. Correct?
> Totally incorrect, IMHO. RFC2119 SHOULD's are real conformance
> requirements that a spec admits can be disobeyed in some cases,
> given good enough reasons. Absent such good reasons, they are
> requirements, period.
C047 does not have a hard machine-checkable definition. It does not
cite testing particular Unicode character properties, for example.
Moreover, numeric escapes of characters of any kind are expanded by
the parser and are, therefore, totally harmless in the parsed
document tree, because you can't even detect them there.
C048 as far as text/html goes is even bad advice in terms of really
backward backwards compatibility. In the case of XML, both decimal
and hexadecimal have been supported from day 1. However, both decimal
and hexadecimal are equally right as far as the XML 1.0 spec is
concerned and neither causes any technical trouble over the other in
conforming XML processors. Making the Charmod SHOULD an error would
mean proclaiming documents non-conforming over an issue that causes
absolutely no technical trouble in processing with conforming parsers
but is about the view source convenience preference of Charmod
authors! (Besides, there are lookup interfaces that support decimal:
I think it would be unwise to make an (X)HTML5 conformance checking
service cry wolf on C047 and C048. It would only undermine the
usefulness of a conformance checking service for authors and would
dilute the perceived seriousness of errors.
But let's look at all the [C] SHOULDs (quoting from Charmod):
> C022 [S] [I] [C] Character encodings that are not in the IANA
> registry SHOULD NOT be used, except by private agreement.
I guess I could make that an error.
> C049 [I] [C] The character encoding of content SHOULD be chosen so
> that it maximizes the opportunity to directly represent characters
> (ie. minimizes the need to represent characters by markup means
> such as character escapes) while avoiding obscure encodings that
> are unlikely to be understood by recipients.
First, Charmod doesn't define a conclusive list on non-obscure
The XML side warns if the encoding is not US-ASCII, ISO-8859-1, UTF-8
or UTF-16. (The XML only requires UTF-8 and UTF-16 to be supported,
so it follows that anything else is optional and, therefore, unsafe.
However, I don't warn on US-ASCII or ISO-8859-1, because I don't want
to cry wolf and I've never seen evidence of XML parsers that didn't
also support US-ASCII and ISO-8859-1. I do have evidence of a popular
parser that only supports those four by default: expat. And there's a
lot of ASCII-only XML out there that is declared ISO-8859-1, which is
harmless in practice.)
As much as I'd like to be able to force everyone to use UTF-8, I am
uncomfortable about making the use of an optionally-supported
encoding an error, since the XML 1.0 spec intentionally leaves
encoding support open-ended. Of course, I could deviously disable a
host of decoders and claim implementation limitations. :-)
On the text/html side, it wouldn't be useful, considering the
practical backwards-compatibility goals of the WHAT WG, to complain
about encodings that "everyone" supports. A passable practical
definition could be the intersection of the IANA-registered encodings
supported by IE6, Opera 9, Firefox 2.0, Safari 2.0.x, Sun JDK 1.4.2
and Python 2.4. (Make that Python 2.3 if you want to take a point
against the CJK encoding soup.)
Also, when an encoding is de facto supported, it is rather useless,
in my opinion, to analyze if it is optimal in terms of byte count and
to proclaim the document non-conforming if it isn't.
> C024 [I] [C] Content and software that label text data MUST use
> one of the names required by the appropriate specification (e.g.
> the XML specification when editing XML text) and SHOULD use the
> MIME preferred name of a character encoding to label data in that
> character encoding.
I already warn if the preferred name isn't used, but I guess I could
make it an error.
> C073 [C] Publicly interchanged content SHOULD NOT use codepoints
> in the private use area.
> C047 [I] [C] Escapes SHOULD only be used when the characters to be
> expressed are not directly representable in the format or the
> character encoding of the document, or when the visual
> representation of the character is unclear.
> C048 [I] [C] Content SHOULD use the hexadecimal form of character
> escapes rather than the decimal form when there are both.
Already discussed above.
> C054 [I] [C] Users of specifications (software developers, content
> developers) SHOULD whenever possible prefer ways other than string
> indexing to identify substrings or point within a string.
hsivonen at iki.fi
More information about the whatwg