[whatwg] Thesis draft about HTML5 conformance checking
ot at w3.org
Sun Mar 11 20:27:12 PDT 2007
On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:
> The draft of my master's thesis is available for commenting at:
Henri, congratulations on your work on the HTML conformance checker
and on the Thesis. It's been a truly informative and enlightening
reading, especially the parts where you develop on the (im)
possibility of using only schemas to describe conformance to the
html5 specs. This is a question that has been bothering me for a long
time, especially as there is only one (as of today) production-ready
conformance checking tool not based on some kind (or combination) of
schema-based parsers, and although, as it is often pointed out, no
browser uses a DTD-based parser in their engine today, I still think
producing a schema representation of (most of) the conformance
criteria help adoption and implementation.
Some comments based on first read through the thesis, below.
I'm cross-posting them to the www-validator list at w3c, as I think
your thesis will be of interest to a number of subscribers of that
For www-validator, Henri's announcement and rfc -
> [2.3.2] I share the view of the Web that holds WebKit, Presto,
> Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox
> and IE, respectively) to be the most important browser engines.
Did you have a chance to look at engines in authoring tools? What
type of parser do NVU, Amaya, golive etc work on?
How about parsing engines for search engine robots? These are
probably as important, if not more as some of the browser engines in
defining the "generic" engine for the web today.
> [4.1] The W3C Validator sticks strictly to the SGML validity
> formalism. It is often argued that it would be inappropriate for a
> program to be called a “validator” unless it checks exactly for
> validity in the SGML sense of the word – nothing more, nothing less.
That's very true, there's a strong reluctance from part of the
validator user community tool to do anything else than formal
validation, mostly (?) out of fear that it would eventually make the
term of "validation" meaningless. The only thing the validator does
beyond DTD validation are the preparse checks on encoding, presence
of doctype, media type etc.
I think it will change over time, in fact it's already changing, as
the innards of the validator have moved to a SAX-based parsing. It's
going to be an opportunity to add data type checking and move closer
to conformance checker than validator. Work at W3C on Unicorn  and
little modules such as the Appendix C checker  for XHTML1.0 also
go in that direction.
> [6.1.3] Erroneous Source Is Not Shown
> The error messages do not show the erroneous markup. For this
> reason it is unnecessarily hard for the user to see where the
> problem is.
Was this by lack of time? Did you have a look at existing
implementations? Oh I see [ 8.10 Showing the Erroneous Source Markup]
as future work. If you're looking for a decent, though by no means
perfect, implementation, look for sub truncate_line in
(this is to be modularized out of the check script and into a cpan
module sooner or later, see )
> [6.2] Instead of modifying the libraries themselves, an alternative
> approach to localization would be reverse templating. The English
> messages would be matched against known patterns that would allow
> the variable parts to be extracted. The variable parts could then
> be plugged into a translated message corresponding to the matched
This is something I have been looking at, and had come to the same
conclusion. I'm hoping to be able to reuse, in one way or another,
the existing localization of some of the libraries being used (e.g.
OpenSP, with all its issues, has a very impressive localization record).
> [8.1] Even though the software developed in this project is Free
> Software / Open Source, it has not been developed in a way that
> would make it easily approachable to potential contributors.
> Perhaps the most pressing need for change in order to move the
> software forward after the completion of this thesis is moving the
> software to a public version control system and making building and
> deploying the software easy.
Making it available on a more open-sourcey system, with a multi-user
revision system will probably not create an explosion of code
contributors (you've had very helpful contributions from e.g Elika,
and most OS projects, even successful ones, never have more than a
handful of coders), but you may be able to create a healthy community
of users, reviewers, bug spotters, translators, document editors,
beyond the whatwg community.
If you're interested in using w3c logistics, and benefit from the
existing communities around w3c, I'm happy to invite you. Sourceforge
would be another excellent choice - only with different tools,
different community of users.
> [8.8] To support the use of the conformance checker back end from
> other applications (non-Java applications in particular), a Web
> service would be useful.
Indeed. Did you have a chance to look at EARL?
I wrote some basic notes at http://lists.w3.org/Archives/Public/www-
and the EARL WG staff contact helped me answer some questions, and re-
assessed that validators/conformance checkers where one of their main
Hope these initial thoughts/comments can be useful.
Thanks again for your interesting work.
More information about the whatwg