[whatwg] Low-level conformance issues
Ian Hickson
ian at hixie.ch
Fri Mar 10 13:00:21 PST 2006
On Sun, 14 Aug 2005, Henri Sivonen wrote:
>
> & must start an NCR or an entity reference as in XML. (Rationale: Lone &
> likely a mistake anyway.)
Agreed.
> ' is not considered conforming. (Rationale: Did not exist in HTML4 and is
> not supported by IE)
Disagreed. Consistency on XML seems like a very good thing here. I've also
added AMP, COPY, LT, GT, QUOT and REG for compatibility, and made them
conformant. It seems like those would be useful in all-caps text.
> Entity references and NCRs have to be terminated explicitly with a
> semicolon. (Rationale: Implicit termination is likely a mistake unless
> the person who wrote the reference is an SGML pedant. Requiring the
> semicolon makes things unambiguous for sure. Also, having an explicit
> delimiter helps in avoiding lookahead/pushback in the parser.)
Agreed.
> Astral non-characters are not banned. (They are not banned in XML 1.0,
> either.)
The only character that get dropped in the spec are U+0000 and U+000D
(the latter having special processing converting some of them to U+000A).
So I agree, I guess, unless I misunderstood your comment.
> Unescaped < and > in attributes are allowed without warning despite
> folklore that warns about this breaking unspecified legacy UAs.
Agreed.
> Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*,
> which is slightly restrictive in a semi-arbitrary way for implementation
> convenience.
Disagreed. Unquoted attribute value syntax is pretty lax in the spec...
also for implementation convenience. :-)
> The elements script and style are treated as CDATA. The string "</" may
> only occur as part of the end tag. (Rationale: This approach is both
> compatible with SGML and the way browsers work. Also, this avoids
> lookahead/lookback.)
Agreed.
> PIs are banned. As are marked sections.
Agreed. They both end up forming bogus comments.
> Doctypes with the SYSTEM id only are banned.
> The internal subset is banned.
> The HTML5 doctype passes silently.
> The HTML 4.01 Strict and Transitional doctypes cause a warning about the
> HTML5-centric nature of the parser.
> Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned with a
> special message.
> Other doctypes are treated as errors as is the lack of a doctype.
> The lack of a system id in the HTML 4.01 Transitional doctype is treated as an
> error.
> The lack of a system id in the HTML 4.01 Strict doctype causes a warning even
> though the spec says "must" and gives a doctype with a system id.
> Failure to use the canonical system ids cause warnings even though the "must"
> in HTML 4.01 could be interpreted as banning these.
DOCTYPEs other than <!DOCTYPE HTML> (case-insensitive) all cause parse
errors, and may trigger quirks mode.
> The internal character encoding information is not passed to the
> application as content for consistency with the XML declaration, which
> is not exposed through the SAX2 ContentHandler.
Nothing special is done for this.
> The BOM is sniffed.
> The lack of character encoding information (including the BOM) is treated as a
> fatal error.
This part of the spec needs work.
> > But I haven't thought much about this yet. The way parsing is to be
> > defined I expect to just say "parsers should do this, and if they hit
> > this they should do this, and if they hit this it's an error and they
> > should do this", with confomance checkers having to do the same but
> > reporting the errors. If that makes sense.
>
> My parser is (almost) Draconian, so I don't intend to implement the
> elaborate error recovery that is needed for browsers. (I have no
> interest in competing with John Cowan's TagSoup.)
The spec explains how to recover from parse errors, but doesn't require
recovery from conformance checkers.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list