[whatwg] HTML5 Parsing spec first draft ready
hsivonen at iki.fi
Sun Feb 19 12:38:43 PST 2006
On Feb 16, 2006, at 00:56, Dan Brickley wrote:
> Discussing some related work (GRDDL) in the W3C SemWeb CG, I was
> wondering whether there is any way your parser spec could be
> specified as input for a GRDDL transform. GRDDL provides techniques
> transforming XML-based languages (including XHTML) into an RDF
> representation; typically by reference to an XSLT. If the WHATWG
> parser spec defined itself in terms of some XML-shaped output, the two
> should chain nicely together. Have you considered defining the parser
> behaviour in terms of XML concepts?
HTML5 parsing for browsers that support scripting needs to be defined
in such a way that a legacy-compatible HTML DOM is produced. However,
there are apps other than browsers (eg. CMSs, conformance checkers
and search engines) that will, in my opinion, be better off if they
don't run their code against the HTML DOM but instead convert HTML
documents into equivalent XHTML documents as early as possible and
then work with XHTML internally. I guess whatever apps use GRDDL or
XSLT are likely to be in the class of apps that are better off
working with XHTML internally.
(In the conversion from HTML to XHTML, the XHTML serialization can be
optimized away and does not have to exist in memory at any stage.
With HTML 4.01 and Java, TagSoup would be appropriate for the job.)
To this end, I think it would be beneficial if for every conforming
HTML5 document there was an unambiguous equivalent representation in
canonicalized (per XML C14N) XHTML. I have not reviewed the spec
lately to see if this is already the case, but I expect it to be.
(Obviously, this cannot be the case for non-conforming documents
since the output DOM of the parsing algorithm can have eg. attribute
names that are forbidden in XML 1.0.)
Off the top of my head, the changes from the HTML parsing output
involve (besides lowercasing names and putting elements in the XHTML
1.x namespace) getting rid of the meta element conveying character
encoding information, mapping the lang attribute to xml:lang, copying
the name of boolean attributes into the value and perhaps some issues
with line breaks in attribute values.
Whether the spec needs to say any of this is another matter
altogether. For interop, speccing what browsers need to do is the
most important task.
hsivonen at iki.fi
More information about the whatwg