[whatwg] Writing authoring tools and validators for custom microdata vocabularies

Henri Sivonen hsivonen at iki.fi
Wed May 20 03:50:02 PDT 2009

On May 20, 2009, at 10:27, Henri Sivonen wrote:

> However, in order to usefully apply RELAX NG or Schematron to a  
> microdata-base infoset, the infoset conversion should turn property  
> names into element names. Since XML places arbitrary limitations on  
> element names (and element content), this mapping would have exactly  
> the same complications as mapping microdata to RDF/XML.

Here's an attempt at mapping microdata to XML:

  * Have a root element (it doesn't matter what it's called) with  
attribute xml:lang that has the language of the root element of the  
HTML document.
  * Have a child of root with local name 'title', namespace 'http://purl.org/dc/terms/title' 
  and content that is the content of HTML <title>
  * For each link relation in the document, have a child of root that  
has as its local name the ASCII-lowercased rel token (or ALTERNATE- 
STYLESHEET for alternate stylesheet), namespace http://www.w3.org/1999/xhtml/vocab# 
  and no-namespace attribute 'url' that contains the absoluticized  
href of the link relation.
  * For each <meta name content>, have a child of root with the value  
of the name attribute of the <meta> as local name, namespace http://www.w3.org/1999/xhtml/vocab# 
  and the value of the content attribute as element content. If the  
language of the <meta> differs from root, have xml:lang with the  
different language.
  * For cites, do the link thing analogously to how cites are handled  
in the RDF conversion.
  * For items and properties:
    - map the property name to XML namespace,local pair as follows and  
use the result as the element name for the 'property element':
      * If itemprop contains a colon: Locate the last # or / whichever  
comes last but isn't the last character of the URI. Make the part up  
to and including that character the namespace URI and the part after  
the local name.
      * Otherwise: Namespace is http://www.w3.org/1999/xhtml/custom#  
and the propitem token is the local name.
    - If value is a URL, put the URL value in an attribute called  
'url' on the property element.
    - If the value is itself an item, put the value of the item  
attribute on the property element in the value of an attribute called  
'type' in no namespace.
    - Otherwise, put the string value in the content of the property  
element and put the language of the property on the xml:lang attribute  
of the property element if different from its nearest ancestor xml:lang.

Without actually trying, on the face of things, this kind of mapping  
seems tractable to RELAX NG schemas.

And, as mentioned before, this breaks when:
  1) The local name becomes non-NCName.
  2) textContent in HTML contains non-XML characters

Use the infoset coercion rules for those. However, the Uhhhhhh  
notation may be collided, because microdata property names aren't  

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list