[whatwg] Relationship to Charmod and Charmod Norm

Sat Nov 11 05:35:56 PST 2006

On Nov 11, 2006, at 01:13, François Yergeau wrote:

> Henri Sivonen a écrit :
>> Does C003 in Charmod outlaw bdo?
>
> Nope.  bdo is simply an assertion by the author that the  
> presentation order is not the usual one for the script.  The text  
> is still stored, interchanged and processed in logical order.

OK.

>> I think C073 shouldn't render a document non-conforming.
>
> Disagree.  C073 is a SHOULD NOT and it should carry over to HTML  
> conformance stricto sensu (i.e. as per RFC 2119).

I agree that, in general, PUA characters aren't suitable for public  
interchange. However, I don't think it is necessarily a good idea to  
make a conformance checker proclaim documents that contain them non- 
conforming. I do think that a warning is called for. See also C040.

There are cases when PUA characters are the best available way to  
communicate something:
http://www.evertype.com/standards/csur/

I have tried hard to avoid marketing the would-be conformance  
checking service the same way fanboys market the W3C Validator. I  
intend to conformance checking service to be a tool that helps  
authors--not a graven image that needs to be satisfied at all cost.  
Regardless, I need to consider what kind of behavior the conformance  
checking service could induce among those who don't see the big  
picture but want their documents to have zero errors reported.

If the use of PUA characters were errors, the people who want zero  
errors from a conformance checker at all cost could move from  
violating C073 to violating C076, which would be much worse but not  
detectable by a conformance checker. (I'm not suggesting that Everson  
& Cowan would do this, but, you know, others. :-)

>> Would it be too annaying to emit a warning? Perhaps one warning  
>> per document rather than per character?
>
> No more than one per doc, please!

OK.

>> I think authors wouldn't like warnings on C047 and C048.
>
> Perhaps, perhaps not.  Some authors want their apps to keep them as  
> close to spec as possible.  Authoring tools should certainly abide  
> by C047 and C048 when generating escapes on behalf of the author.
>
>> Moreover, I think it should be concluded that Charmod SHOULD  
>> violation don't make an (X)HTML5 document non-conforming. Correct?
>
> Totally incorrect, IMHO.  RFC2119 SHOULD's are real conformance  
> requirements that a spec admits can be disobeyed in some cases,  
> given good enough reasons.  Absent such good reasons, they are  
> requirements, period.

C047 does not have a hard machine-checkable definition. It does not  
cite testing particular Unicode character properties, for example.  
Moreover, numeric escapes of characters of any kind are expanded by  
the parser and are, therefore, totally harmless in the parsed  
document tree, because you can't even detect them there.

C048 as far as text/html goes is even bad advice in terms of really  
backward backwards compatibility. In the case of XML, both decimal  
and hexadecimal have been supported from day 1. However, both decimal  
and hexadecimal are equally right as far as the XML 1.0 spec is  
concerned and neither causes any technical trouble over the other in  
conforming XML processors. Making the Charmod SHOULD an error would  
mean proclaiming documents non-conforming over an issue that causes  
absolutely no technical trouble in processing with conforming parsers  
but is about the view source convenience preference of Charmod  
authors! (Besides, there are lookup interfaces that support decimal:  
http://www.eki.ee/letter/ )

I think it would be unwise to make an (X)HTML5 conformance checking  
service cry wolf on C047 and C048. It would only undermine the  
usefulness of a conformance checking service for authors and would  
dilute the perceived seriousness of errors.

But let's look at all the [C] SHOULDs (quoting from Charmod):
> C022	[S] [I] [C] 	Character encodings that are not in the IANA  
> registry SHOULD NOT be used, except by private agreement.

I guess I could make that an error.

> C049	[I] [C] 	The character encoding of content SHOULD be chosen so  
> that it maximizes the opportunity to directly represent characters  
> (ie. minimizes the need to represent characters by markup means  
> such as character escapes) while avoiding obscure encodings that  
> are unlikely to be understood by recipients.

First, Charmod doesn't define a conclusive list on non-obscure  
encodings.

The XML side warns if the encoding is not US-ASCII, ISO-8859-1, UTF-8  
or UTF-16. (The XML only requires UTF-8 and UTF-16 to be supported,  
so it follows that anything else is optional and, therefore, unsafe.  
However, I don't warn on US-ASCII or ISO-8859-1, because I don't want  
to cry wolf and I've never seen evidence of XML parsers that didn't  
also support US-ASCII and ISO-8859-1. I do have evidence of a popular  
parser that only supports those four by default: expat. And there's a  
lot of ASCII-only XML out there that is declared ISO-8859-1, which is  
harmless in practice.)

As much as I'd like to be able to force everyone to use UTF-8, I am  
uncomfortable about making the use of an optionally-supported  
encoding an error, since the XML 1.0 spec intentionally leaves  
encoding support open-ended. Of course, I could deviously disable a  
host of decoders and claim implementation limitations. :-)

On the text/html side, it wouldn't be useful, considering the  
practical backwards-compatibility goals of the WHAT WG, to complain  
about encodings that "everyone" supports. A passable practical  
definition could be the intersection of the IANA-registered encodings  
supported by IE6, Opera 9, Firefox 2.0, Safari 2.0.x, Sun JDK 1.4.2  
and Python 2.4. (Make that Python 2.3 if you want to take a point  
against the CJK encoding soup.)

Also, when an encoding is de facto supported, it is rather useless,  
in my opinion, to analyze if it is optimal in terms of byte count and  
to proclaim the document non-conforming if it isn't.

> C024	[I] [C] 	Content and software that label text data MUST use  
> one of the names required by the appropriate specification (e.g.  
> the XML specification when editing XML text) and SHOULD use the  
> MIME preferred name of a character encoding to label data in that  
> character encoding.

I already warn if the preferred name isn't used, but I guess I could  
make it an error.

> C073	[C] 	Publicly interchanged content SHOULD NOT use codepoints  
> in the private use area.
> C047	[I] [C] 	Escapes SHOULD only be used when the characters to be  
> expressed are not directly representable in the format or the  
> character encoding of the document, or when the visual  
> representation of the character is unclear.
> C048	[I] [C] 	Content SHOULD use the hexadecimal form of character  
> escapes rather than the decimal form when there are both.

Already discussed above.

> C054	[I] [C] 	Users of specifications (software developers, content  
> developers) SHOULD whenever possible prefer ways other than string  
> indexing to identify substrings or point within a string.

Not machine-checkable.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/