[whatwg] Writing authoring tools and validators for custom microdata vocabularies

Wed May 20 00:27:10 PDT 2009

On May 20, 2009, at 04:36, Ian Hickson wrote:

>   REQUIREMENTS:
>     * There should be a definitive location for vocabularies.

If this means that vocabulary schemas should live in a predestined URI  
subspace, I'm inclined to disagree with this requirement, because
  1) for non-predefined vocabularies it would leave vocabulary  
definition as decentralized but would make schemas centralized, which  
doesn't make sense
  2) for predefined vocabularies it would create a single point of  
failure by elevating a given dereferencable URI to a special status.

>     * It should be possible for vocabularies to describe other  
> vocabularies.

I disagree with this requirement. Being able to define a schema  
language in microdata is sufficiently different from other microdata  
use cases that addressing this requirement could have adverse  
complicating effects on other use cases. Furthermore, it is completely  
unclear why schemas would need to be embedded in HTML pages.

>     * Originating vocabulary documents should be discoverable.

Does this mean something like xsi:schemaLocation? I thought that the  
RELAX NG community had debunked this as an anti-pattern for all other  
cases except for use cases analogous to the Emacs modeline (i.e.  
giving a generic XML editor a path to a *local* schema file in order  
to choose autocompletion rules on a per-document basis). See http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html

>     * Machine-readable vocabulary information shouldn't be on a  
> separate
>       page than the human-readable explanation.

Why is this a requirement? It seems like a radical departure from the  
practice of having DTD / XSD / RELAX NG schemas in addition to spec  
prose in HTML or PDF.

>     * There must not be restrictions on the possible ways  
> vocabularies can
>       be expressed (e.g. the way DTDs restricted possible grammars  
> in SGML).

This seems to preclude any generic schema language as the One True  
schema language.

> For other vocabularies, I recommend using RDFS and OWL, and having the
> tools support microdata as a serialisation of RDF.

I'm inclined to think this recommendation may not be the best one.

It seems that RDFS or OWL are obviously applicable to the result of  
microdata to RDF conversion. However, RDFS and OWL are designed for  
the RDF model, which is more general than the microdata model. Since  
the microdata model is an array of trees (which may be considered one  
big tree with the root being of a different type than the other  
nodes), it would make sense--on the high level--to apply the same  
techniques one would apply with XML trees: tree automata (like RELAX  
NG), assertions on trees (like Schematron) and custom code operating  
on trees.

While it would be possible to make new schema languages for microdata  
applying the ideas from RELAX NG and Schematron, it would be easier to  
use off-the-shelf RELAX NG and Schematron tools and to map microdata  
to an XML infoset for validation. However, in order to usefully apply  
RELAX NG or Schematron to a microdata-base infoset, the infoset  
conversion should turn property names into element names. Since XML  
places arbitrary limitations on element names (and element content),  
this mapping would have exactly the same complications as mapping  
microdata to RDF/XML.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/