[whatwg] Questions regarding microdata implementations.

Sun Jan 16 06:04:02 PST 2011

First of all, I would like to say hello to the whatwg community and
introduce myself. My name is Emiliano Martínez Luque and I used to be
(sort of) active in the microformats community (I wrote a
parser/extractor/validator: http://code.google.com/p/xmfp). I have
been reviewing the microdata specification (and the mailing list
archives) and I'm interested in writing a parser/extractor. Well, I
have of course a variety of questions regarding implementation and
also regarding certain details of the microdata spec. Before I do
that, I want to say that I consider the microdata specification to be
a huge step forward, as somebody that is interested in writing
applications that consume structured data from within the web, I
consider the clear separation of the syntax for representing the data
from the vocabularies being represented, a definitive (and
qualitative) advance. I would also like to praise the simplicity and
clarity of the specification.

This are my questions:

1) The specification does not define any mechanism for an application
using the microdata to deal with possible misuses of data
vocabularies.

For example, let's say a web developer intends to mark up a data
vocabulary for cats (I'm basing this on the examples on the spec). The
name-value pairs he intends to markup are the following (expressed in
JSON notation):

{ name:"Hedral", color:"black" }

Based on the examples on the spec this could be marked up as:

<section itemscope itemtype="http://example.org/animals#cat">
 <h1 itemprop="name">Hedral</h1>
 <p itemprop="color">black<span
</section>

However, we could assume that authors might sometimes mistype the
names of the item properties. In the example:

<section itemscope itemtype="http://example.org/animals#cat">
 <h1 itemprop="nme">Hedral</h1>
 <p itemprop="colr">black<span
</section>

Which a procesor might interpret as:

{ nme:"Hedral", colr:"black" }

I could easily imagine other misuses, like for example an itemprop
that should be represented as a simple name-value pair being
represented as a full item with item scope or vice versa, etc.

Since there are no mechanism specified in the spec for defining and
validating the vocabularies being extracted from the microdata, what
is the proposed course of action for an implementation in a case like
this? Or should applications always assume that the data has been
correctly marked up?

Which brings me to question 2.

2) The specs specify item types should be identified by URLs. It is
not completely clear (or at least not clear to me) whether they
represent the string of the URL as a URI for unambiguously
representing the item type, a URL for a document that defines that
item type or both. which is the case?

In the case that it represents a document I would like to know which
formats are being considered, and if the general idea is to have a
unique format or to let data vocabularies be defined in a variety of
different types of formats. I would also like to know if there is any
working group/community/forum that is working specifically on
producing a format for defining and validating data vocabularies in a
machine processable way in a simple manner, and what documentation
they are producing.

If there is no work on this I would like to propose the following. For
the purpose of simply validating:

- correct names
- correct types (whether it's a name:value pair or a full item)
- correct number of occurrences (Whether it can be an array of values
or just a single value, whether it is required or not)

It would suffice to specify a data structure with the following
attributes: property-name, occurrences and childs. Assuming that if a
property has childs then it's value is a full item, rather than a
simple text value. This could easily be represented in JSON with
something like:

{
	property_name:"name of the property as used in itemprop",
	occurrences:"*",
	childs:[ {}, {}, {}...  ]
}

Where childs could be an array of data property definitions, for example:

{
	property_name:"name of the property as used in itemprop",
	occurrences:"*",
	childs:[ {
			property_name:"name of the property as used in itemprop for the
first child",
			occurrences:"1"
		 }, {
			property_name:"name of the property as used in itemprop for the
second child",
			occurrences:"*",
			 childs:[ {}, ...]
		 }
		]
}

This could even be represented in microdata itself:

<div itemscope itemtype="datavocabularies.com/microdata">
	<p itemprop="property_name">name of the property</p>
	<p itemprop="occurrences">*</p>
	<div itemprop="childs" itemtype="datavocabularies.com/microdata">
		<p itemprop="property_name">name of the property for the first child</p>
		<p itemprop="occurrences">1</p>
	<div>
	<div itemprop="childs" itemtype="datavocabularies.com/microdata">
		<p itemprop="property_name">name of the property for the second child</p>
		<p itemprop="occurrences">1</p>
	<div>
</div>			

An application could easily implement this. For example, an
implementation in C of this simple recursive data structure, could be:

struct data_prop {
	char property_name[ PROPERTY_NAME_MAX_LENGTH ];
	char ocurrences[1];
	struct data_prop *childs[ DATA_PROPERTY_MAX_CHILDS ];
};

Where occurrences could be represented by a subset of unix regexp
constants (say: *, +, ?, 1).

(Of course an extra attribute of (int) number_of_childs would be
needed for this to be of any use for an actual C program, I'm just
trying to provide an example in a common language.)

In this sense an application consuming microdata could receive 2
inputs: the html document containing the microdata and the set of
data-vocabularies definitions to validate the represented microdata.
It would be very simple to build a validator on top of this. Besides,
having a simple syntax for defining data vocabularies and validating
microdata, would also be very helpful for coordinating the work of
data vocabulary authors.

Going further into this, we could also think about a datatype property
for specialised applications that may require them, etc. Again, if no
work has been done on this, I would like to know if there is interest
in the community in starting work on this (within the community forums
provided by the whatwg or outside as an independent project).

3) The specification states that itemref references a node within the
html tree, referencing it by it's id. However it specifies nothing
regarding how the referenced node should be marked up. Since, the
nodes referenced may exist before the itemrefs, an application
discovering microdata may have to do multiple passes through the html
tree to extract this information. I would like to know, if any thought
has been given to using itemscope within the referenced node, ie:

<div itemscope id="a">
	<p itemprop="a1">value of a1</p>
	<p itemprop="a2">value of a2</p>
</div>

<div itemscope id="b">
	<p itemprop="b1">value of b1</p>
	<div itemscope id="d" itemref="a"></div>
</div>

Where a1="value of a1" and a2="value of a2" are childs belonging to
the item identified as d which is itself a child of b. The advantage
of this is that an application extracting the microdata could then
extract all elements marked up with itemscope and then merge them
according to itemref references without having to do multiple passes.
This might not be very important but could help to have better
efficiency when extracting microdata from big quantities of deep
referenced documents or when dealing with limited resources.

4) What is the intended behaviour of an application when encountering
a loop within the itemref references? ie:

<div itemscope id="a" itemref="b c d"></div>

<p id="b"><span itemprop="x">x value</span></p>
<div id="c">
	<p>Y:<span itemprop="y">y value</span></p>
	<p>Z: <span itemprop="z">z value</span></p>
</div>
<div itemscope id="d" itemref="a"></div>

In a case like this, should the whole node with id="a" be discarded or
only the subnode with id="d"? Or is this up to the implementor?

I would like to point out that this is another reason to have some
(however loose) mechanism for data vocabulary validation for dealing
with user errors.

5) The specification states:

"The itemref attribute, if specified, must have a value that is an
unordered set of unique space-separated tokens that are
case-sensitive, consisting of IDs of elements in the same home
subtree."

(5.2.2 of http://www.whatwg.org/specs/web-apps/current-work/#microdata)

I would like to know if there has been any thoughts given to
referencing fragments on an outside document. For example, a document
with URL http://www.personaldata.com/me.html might contain the
following fragment:

<div itemscope itemtype="http://www.datavocabulary.com/person">
	<p>My name is <span itemprop="name">Pepe</span> and I used work at <a
itemprop="org" href="http://www.organization.com/about_us.html#org_data">organization</a></p>
</div>

While at  http://www.organization.com/about_us.html#org_data you could
have the following fragment:

<div id="org_data" itemtype="http://www.datavocabulary.com/org">
	<p itemprop="legal_name">Organization XYZ</p>
....
</div>

Or something similar for referencing specific data vocabularies
outside of the node tree. Or maybe, I'm missing something and this is
contemplated within the general use of href? My question is whether
there is a mechanism for referencing items from a document outside the
home subtree as subproperties of a microdata item? Is it correct to
use href for this? And, should an application dealing with microdata
be aware of this?

Other than that, thank you for this great spec and best regards,

-- 
Emiliano Martínez Luque
http://www.metonymie.com