[whatwg] Thesis draft about HTML5 conformance checking
hsivonen at iki.fi
Wed Mar 28 05:24:09 PDT 2007
On Mar 12, 2007, at 05:27, olivier Thereaux wrote:
> On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:
>> The draft of my master's thesis is available for commenting at:
> Henri, congratulations on your work on the HTML conformance checker
> and on the Thesis.
> It's been a truly informative and enlightening reading, especially
> the parts where you develop on the (im)possibility of using only
> schemas to describe conformance to the html5 specs. This is a
> question that has been bothering me for a long time, especially as
> there is only one (as of today) production-ready conformance
> checking tool not based on some kind (or combination) of schema-
> based parsers,
I take it that you mean the Feed Validator?
>> [2.3.2] I share the view of the Web that holds WebKit, Presto,
>> Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox
>> and IE, respectively) to be the most important browser engines.
> Did you have a chance to look at engines in authoring tools?
I didn't investigate them beyond mentioning three authoring tools
that have a RELAX NG-driven auto-completion feature.
> What type of parser do NVU, Amaya, golive etc work on?
For authoring tools, the key thing is that their serializers work
with browser parsers. The details of how authoring tools recovers
from bad markup is not as crucial as recovery in browsers because
with authoring tools the author has a chance review the recovery result.
> How about parsing engines for search engine robots? These are
> probably as important, if not more as some of the browser engines
> in defining the "generic" engine for the web today.
Search engines are secretive about what they do, but I would assume
that they'd want be compatible with browsers in order to fight SEO
>> [4.1] The W3C Validator sticks strictly to the SGML validity
>> formalism. It is often argued that it would be inappropriate for a
>> program to be called a “validator” unless it checks exactly for
>> validity in the SGML sense of the word – nothing more, nothing less.
> That's very true, there's a strong reluctance from part of the
> validator user community tool to do anything else than formal
> validation, mostly (?) out of fear that it would eventually make
> the term of "validation" meaningless. The only thing the validator
> does beyond DTD validation are the preparse checks on encoding,
> presence of doctype, media type etc.
ISO and the W3C have already expanded the notion of validation to
cover schema languages other than DTDs. In colloquial usage
"validation" is already understood to mean checking in general. The
notion of a "schema" could be detached from a schema language to be
be an abstract partitioning of the set of possible XML documents into
two disjoint sets: valid and invalid. Calling the process of deciding
which set a given document instance belongs into "validation" would
give a formal definition that matched the colloquial usage.
I do sympathize with Hixie's reluctance to call "HTML5 conformance
checking" "HTML5 validation", though. Calling it "conformance
checking" makes sure that others don't have a claim on defining what
it means. Fighting the colloquial usage will probably be futile,
though, outside spec lawyerism.
>> [6.1.3] Erroneous Source Is Not Shown
>> The error messages do not show the erroneous markup. For this
>> reason it is unnecessarily hard for the user to see where the
>> problem is.
> Was this by lack of time?
Yes. Showing the source code based on the SAX-reported line and
column numbers is useful but it isn't novel enough or central enough
to proving the feasibility of the chosen implementation approach for
it to delay the publication of the thesis.
Observing the thesis projects of my friends who started before me has
taught me that it is a mistake to promise a complete software product
as a precondition for the completion of the thesis. Software always
has one more bug to fix or one more feature to add. On the other
hand, as far as the academic requirements go, one could even write a
thesis explaining why a project failed.
> Did you have a look at existing implementations?
On this particular point, not yet.
> Oh I see [ 8.10 Showing the Erroneous Source Markup] as future
> work. If you're looking for a decent, though by no means perfect,
> implementation, look for sub truncate_line in
Thanks. I'll keep this in mind.
>> [8.1] Even though the software developed in this project is Free
>> Software / Open Source, it has not been developed in a way that
>> would make it easily approachable to potential contributors.
>> Perhaps the most pressing need for change in order to move the
>> software forward after the completion of this thesis is moving the
>> software to a public version control system and making building
>> and deploying the software easy.
> Making it available on a more open-sourcey system, with a multi-
> user revision system will probably not create an explosion of code
> contributors (you've had very helpful contributions from e.g Elika,
> and most OS projects, even successful ones, never have more than a
> handful of coders), but you may be able to create a healthy
> community of users, reviewers, bug spotters, translators, document
> editors, beyond the whatwg community.
I am not expecting an explosion of contributors. However, I have a
reason to believe that my current arrangement has caused at least one
potential contributor to walk away. I'd rather avoid turning people
Also, in the future, I'd like to make it super-easy for CMS
developers to integrate the conformance checker back end to their
products. To enable this, the barrier for getting a runnable copy
should be low.
I'm very pessimistic about translations. Even the online markup
checkers whose authors have borne the burden of making the messages
translatable aren't getting numerous translation contributions.
> If you're interested in using w3c logistics, and benefit from the
> existing communities around w3c, I'm happy to invite you.
Thank you. I'll keep your offer in mind when it is time to figure out
where to put the source.
>> [8.8] To support the use of the conformance checker back end from
>> other applications (non-Java applications in particular), a Web
>> service would be useful.
> Indeed. Did you have a chance to look at EARL?
I did. I also had a look at the SOAP and Unicorn outputs of the W3C
Validator. I like EARL the least of the three, because its
assumptions about the nature of the checker software do not work well
with implementations that have a grammar-based schema inside. Grammar-
based implementations cannot cite an exact conformance criterion when
a derivation in the grammar fails as demonstrated by the EARL output
of the W3C Validator. The SOAP and Unicorn formats, even if crufty to
my taste, match better the SAX ErrorHandler interface.
I think I saw Relaxed having its own SAX ErrorHandler-friendly
format, but now I can't find it.
> I wrote some basic notes at http://lists.w3.org/Archives/Public/www-
Thanks. My notes are at http://lists.w3.org/Archives/Public/www-
validator/2006Dec/0060.html and http://wiki.whatwg.org/wiki/
Thank you for your comments.
hsivonen at iki.fi
More information about the whatwg