[whatwg] Writing authoring tools and validators for custom microdata vocabularies

Tue May 19 18:36:33 PDT 2009

One of the use cases I collected from the e-mails sent in over the past 
few months was the following:

   USE CASE: It should be possible to write generalized validators and
   authoring tools for the annotations described in the previous use case.

   SCENARIOS:
     * Mary would like to write a generalized software tool to help page
       authors express micro-data. One of the features that she would like to
       include is one that displays authoring information, such as vocabulary
       term description, type information, range information, and other
       vocabulary term attributes in-line so that authors have a better
       understanding of the vocabularies that they're using.
     * John would like to ensure that his indexing software only stores
       type-valid data. Part of the mechanism that he uses to check the
       incoming micro-data stream is type information that is embedded in the
       vocabularies that he uses.
     * Steve, would like to provide warnings to the authors that use his
       vocabulary that certain vocabulary terms are experimental and may
       never become stable.

   REQUIREMENTS:
     * There should be a definitive location for vocabularies.
     * It should be possible for vocabularies to describe other vocabularies.
     * Originating vocabulary documents should be discoverable.
     * Machine-readable vocabulary information shouldn't be on a separate
       page than the human-readable explanation.
     * There must not be restrictions on the possible ways vocabularies can
       be expressed (e.g. the way DTDs restricted possible grammars in SGML).
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

I couldn't find a good solution to this problem.

The obvious solution is to use a schema language, such as RDFS or OWL. 
Indeed, that's probably the only solution that I can recommend. However, 
as we discovered with HTML5, schema languages aren't expressive enough. I 
wouldn't be surprised to find that no existing schema could accurately 
describe the complete set of requirements that apply to the vCard, vEvent, 
and BibTeX vocabularies (though I haven't checked if this is the case). 

For any widely used vocabulary, I think the best solution will be 
hard-coded constraints and context-sensitive help systems, as we have for 
HTML5 validators and HTML editors.

For other vocabularies, I recommend using RDFS and OWL, and having the 
tools support microdata as a serialisation of RDF. Microdata itself could 
probably be used to express the constraints, though possibly not directly 
in RDFS and OWL if these use features that microdata doesn't currently 
expose (like typed properties).

Regarding some of the requirements, I actually disagree that they are 
desireable. For example, having a definitive location for vocabularies has 
been shown to be a bad idea for scalability, with the W3C experiencing 
huge download volume for certain schemas. Similarly, I don't think that 
the "turtles all the way down" approach of describing vocabularies using 
the same syntax as the definition is about (self-hosted schemas) is 
necessary or, frankly, particularly useful to the end-user (though it may 
have nice theoretical properties).

In conclusion: I recommend using an existing RDF-based schema language in 
conjunction with the mapping of microdata to RDF. Implementation 
experience with how this actually works in practice in end-user schenarios 
would be very useful in determining if something more is needed here.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'