[whatwg] Low-level conformance issues

Fri Mar 10 13:00:21 PST 2006

On Sun, 14 Aug 2005, Henri Sivonen wrote:
> 
> & must start an NCR or an entity reference as in XML. (Rationale: Lone & 
> likely a mistake anyway.)

Agreed.

> ' is not considered conforming. (Rationale: Did not exist in HTML4 and is
> not supported by IE)

Disagreed. Consistency on XML seems like a very good thing here. I've also 
added AMP, COPY, LT, GT, QUOT and REG for compatibility, and made them 
conformant. It seems like those would be useful in all-caps text.

> Entity references and NCRs have to be terminated explicitly with a 
> semicolon. (Rationale: Implicit termination is likely a mistake unless 
> the person who wrote the reference is an SGML pedant. Requiring the 
> semicolon makes things unambiguous for sure. Also, having an explicit 
> delimiter helps in avoiding lookahead/pushback in the parser.)

Agreed.

> Astral non-characters are not banned. (They are not banned in XML 1.0, 
> either.)

The only character that get dropped in the spec are U+0000 and U+000D 
(the latter having special processing converting some of them to U+000A). 
So I agree, I guess, unless I misunderstood your comment.

> Unescaped < and > in attributes are allowed without warning despite 
> folklore that warns about this breaking unspecified legacy UAs.

Agreed.

> Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*, 
> which is slightly restrictive in a semi-arbitrary way for implementation 
> convenience.

Disagreed. Unquoted attribute value syntax is pretty lax in the spec... 
also for implementation convenience. :-)

> The elements script and style are treated as CDATA. The string "</" may 
> only occur as part of the end tag. (Rationale: This approach is both 
> compatible with SGML and the way browsers work. Also, this avoids 
> lookahead/lookback.)

Agreed.

> PIs are banned. As are marked sections.

Agreed. They both end up forming bogus comments.

> Doctypes with the SYSTEM id only are banned.
> The internal subset is banned.
> The HTML5 doctype passes silently.
> The HTML 4.01 Strict and Transitional doctypes cause a warning about the
> HTML5-centric nature of the parser.
> Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned with a
> special message.
> Other doctypes are treated as errors as is the lack of a doctype.
> The lack of a system id in the HTML 4.01 Transitional doctype is treated as an
> error.
> The lack of a system id in the HTML 4.01 Strict doctype causes a warning even
> though the spec says "must" and gives a doctype with a system id.
> Failure to use the canonical system ids cause warnings even though the "must"
> in HTML 4.01 could be interpreted as banning these.

DOCTYPEs other than <!DOCTYPE HTML> (case-insensitive) all cause parse 
errors, and may trigger quirks mode.

> The internal character encoding information is not passed to the 
> application as content for consistency with the XML declaration, which 
> is not exposed through the SAX2 ContentHandler.

Nothing special is done for this.

> The BOM is sniffed.
> The lack of character encoding information (including the BOM) is treated as a
> fatal error.

This part of the spec needs work.

> > But I haven't thought much about this yet. The way parsing is to be 
> > defined I expect to just say "parsers should do this, and if they hit 
> > this they should do this, and if they hit this it's an error and they 
> > should do this", with confomance checkers having to do the same but 
> > reporting the errors. If that makes sense.
> 
> My parser is (almost) Draconian, so I don't intend to implement the 
> elaborate error recovery that is needed for browsers. (I have no 
> interest in competing with John Cowan's TagSoup.)

The spec explains how to recover from parse errors, but doesn't require 
recovery from conformance checkers.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'