[whatwg] Low-level conformance issues
hsivonen at iki.fi
Sun Aug 14 02:50:49 PDT 2005
On Jul 30, 2005, at 00:17, Ian Hickson wrote:
> On Fri, 29 Jul 2005, Henri Sivonen wrote:
>> I would like to add HTML (both 4 and 5) support to
>> http://hsivonen.iki.fi/validator/ .
I have now have an initial version of a (mostly) Draconian HTML5
parser. It does not do tag inference, yet, so the documents have to be
The parser does not attempt to convert BASE into xml:base.
.class and .java available:
The class with main() is
fi.iki.hsivonen.htmlparser.test.HtmlParserTestDriver. The test driver
requires stuff from GNU JAXP in the classpath. The parser itself does
not depend on GNU JAXP. The test driver takes files whose name ends
with ".html" as arguments and *overwrites* corresponding ".xhtml" files
with the conversion result.
And to the spec-related point: I made the following decisions while
implementing. Hopefully the document conformance requirements will
& must start an NCR or an entity reference as in XML. (Rationale: Lone
& likely a mistake anyway.)
' is not considered conforming. (Rationale: Did not exist in HTML4
and is not supported by IE)
Entity references and NCRs have to be terminated explicitly with a
semicolon. (Rationale: Implicit termination is likely a mistake unless
the person who wrote the reference is an SGML pedant. Requiring the
semicolon makes things unambiguous for sure. Also, having an explicit
delimiter helps in avoiding lookahead/pushback in the parser.)
Astral non-characters are not banned. (They are not banned in XML 1.0,
Unescaped < and > in attributes are allowed without warning despite
folklore that warns about this breaking unspecified legacy UAs.
Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*,
which is slightly restrictive in a semi-arbitrary way for
The elements script and style are treated as CDATA. The string "</" may
only occur as part of the end tag. (Rationale: This approach is both
compatible with SGML and the way browsers work. Also, this avoids
PIs are banned. As are marked sections.
Doctypes with the SYSTEM id only are banned.
The internal subset is banned.
The HTML5 doctype passes silently.
The HTML 4.01 Strict and Transitional doctypes cause a warning about
the HTML5-centric nature of the parser.
Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned
with a special message.
Other doctypes are treated as errors as is the lack of a doctype.
The lack of a system id in the HTML 4.01 Transitional doctype is
treated as an error.
The lack of a system id in the HTML 4.01 Strict doctype causes a
warning even though the spec says "must" and gives a doctype with a
Failure to use the canonical system ids cause warnings even though the
"must" in HTML 4.01 could be interpreted as banning these.
The internal character encoding information is not passed to the
application as content for consistency with the XML declaration, which
is not exposed through the SAX2 ContentHandler.
The BOM is sniffed.
The lack of character encoding information (including the BOM) is
treated as a fatal error.
>> Assuming that the supported syntax for HTML4 is constrained to exclude
>> minimizations that don't work in browsers, the biggest issue with
>> decoupling the parser from the HTML version seems to be the doctype.
> Makes sense. I would recommend treating the following syntax,
> case-insensitive, as being conformant:
> doctype ::= "<!" "doctype" whitespace+ "html" whitespace* ">"
> But I haven't thought much about this yet. The way parsing is to be
> defined I expect to just say "parsers should do this, and if they hit
> they should do this, and if they hit this it's an error and they
> should do
> this", with confomance checkers having to do the same but reporting the
> errors. If that makes sense.
My parser is (almost) Draconian, so I don't intend to implement the
elaborate error recovery that is needed for browsers. (I have no
interest in competing with John Cowan's TagSoup.)
hsivonen at iki.fi
More information about the whatwg