[whatwg] Low-level conformance issues

Sun Aug 14 02:50:49 PDT 2005

On Jul 30, 2005, at 00:17, Ian Hickson wrote:

> On Fri, 29 Jul 2005, Henri Sivonen wrote:
>>
>> I would like to add HTML (both 4 and 5) support to
>> http://hsivonen.iki.fi/validator/ .
>
> Great!

I have now have an initial version of a (mostly) Draconian HTML5 
parser. It does not do tag inference, yet, so the documents have to be 
fully-tagged.

The parser does not attempt to convert BASE into xml:base.

.class and .java available: 
http://hsivonen.iki.fi/validator-about/htmlparser.jar
The class with main() is 
fi.iki.hsivonen.htmlparser.test.HtmlParserTestDriver. The test driver 
requires stuff from GNU JAXP in the classpath. The parser itself does 
not depend on GNU JAXP. The test driver takes files whose name ends 
with ".html" as arguments and *overwrites* corresponding ".xhtml" files 
with the conversion result.

And to the spec-related point: I made the following decisions while 
implementing. Hopefully the document conformance requirements will 
agree. :-)

& must start an NCR or an entity reference as in XML. (Rationale: Lone 
& likely a mistake anyway.)

' is not considered conforming. (Rationale: Did not exist in HTML4 
and is not supported by IE)

Entity references and NCRs have to be terminated explicitly with a 
semicolon. (Rationale: Implicit termination is likely a mistake unless 
the person who wrote the reference is an SGML pedant. Requiring the 
semicolon makes things unambiguous for sure. Also, having an explicit 
delimiter helps in avoiding lookahead/pushback in the parser.)

Astral non-characters are not banned. (They are not banned in XML 1.0, 
either.)

Unescaped < and > in attributes are allowed without warning despite 
folklore that warns about this breaking unspecified legacy UAs.

Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*, 
which is slightly restrictive in a semi-arbitrary way for 
implementation convenience.

The elements script and style are treated as CDATA. The string "</" may 
only occur as part of the end tag. (Rationale: This approach is both 
compatible with SGML and the way browsers work. Also, this avoids 
lookahead/lookback.)

PIs are banned. As are marked sections.

Doctypes with the SYSTEM id only are banned.

The internal subset is banned.

The HTML5 doctype passes silently.

The HTML 4.01 Strict and Transitional doctypes cause a warning about 
the HTML5-centric nature of the parser.

Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned 
with a special message.

Other doctypes are treated as errors as is the lack of a doctype.

The lack of a system id in the HTML 4.01 Transitional doctype is 
treated as an error.

The lack of a system id in the HTML 4.01 Strict doctype causes a 
warning even though the spec says "must" and gives a doctype with a 
system id.

Failure to use the canonical system ids cause warnings even though the 
"must" in HTML 4.01 could be interpreted as banning these.

The internal character encoding information is not passed to the 
application as content for consistency with the XML declaration, which 
is not exposed through the SAX2 ContentHandler.

The BOM is sniffed.

The lack of character encoding information (including the BOM) is 
treated as a fatal error.

>> Assuming that the supported syntax for HTML4 is constrained to exclude
>> minimizations that don't work in browsers, the biggest issue with
>> decoupling the parser from the HTML version seems to be the doctype.
>
> Makes sense. I would recommend treating the following syntax,
> case-insensitive, as being conformant:
>
>    doctype ::= "<!" "doctype" whitespace+ "html" whitespace* ">"

Thanks.

> But I haven't thought much about this yet. The way parsing is to be
> defined I expect to just say "parsers should do this, and if they hit 
> this
> they should do this, and if they hit this it's an error and they 
> should do
> this", with confomance checkers having to do the same but reporting the
> errors. If that makes sense.

My parser is (almost) Draconian, so I don't intend to implement the 
elaborate error recovery that is needed for browsers. (I have no 
interest in competing with John Cowan's TagSoup.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/