[whatwg] Questions regarding microdata implementations.

Wed May 4 13:17:01 PDT 2011

On Sun, 16 Jan 2011, Emiliano Martinez Luque wrote:
>
> 1) The specification does not define any mechanism for an application 
> using the microdata to deal with possible misuses of data vocabularies.
> 
> For example, let's say a web developer intends to mark up a data 
> vocabulary for cats (I'm basing this on the examples on the spec). The 
> name-value pairs he intends to markup are the following (expressed in 
> JSON notation):
> 
> { name:"Hedral", color:"black" }
> 
> Based on the examples on the spec this could be marked up as:
> 
> <section itemscope itemtype="http://example.org/animals#cat">
>  <h1 itemprop="name">Hedral</h1>
>  <p itemprop="color">black<span
> </section>
> 
> However, we could assume that authors might sometimes mistype the names 
> of the item properties. In the example:
> 
> <section itemscope itemtype="http://example.org/animals#cat">
>  <h1 itemprop="nme">Hedral</h1>
>  <p itemprop="colr">black<span
> </section>
> 
> Which a procesor might interpret as:
> 
> { nme:"Hedral", colr:"black" }
> 
> I could easily imagine other misuses, like for example an itemprop that 
> should be represented as a simple name-value pair being represented as a 
> full item with item scope or vice versa, etc.
> 
> Since there are no mechanism specified in the spec for defining and 
> validating the vocabularies being extracted from the microdata, what is 
> the proposed course of action for an implementation in a case like this? 
> Or should applications always assume that the data has been correctly 
> marked up?

It depends on the vocabulary, in the same way that handling such errors 
for XML vocabularies depends on the vocabulary.

So for example, a vocabulary could say

   "User agents must ignore items that contain properties with names other 
    than "name" and "color"."

...Or:

   "User agents must ignore properties with unrecognised names."

...Or:

   "User agents must treat all properties whose names start with the 
    letter "n" as being equivalent to the property "name", with the first 
    property in lexical order having precedence."

...Or whatever.

For some examples of how to write a vocabulary specification for 
microdata, see the vCard, vEvent, and Licensing works vocabularies:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#mdvocabs

> Which brings me to question 2.
> 
> 2) The specs specify item types should be identified by URLs. It is not 
> completely clear (or at least not clear to me) whether they represent 
> the string of the URL as a URI for unambiguously representing the item 
> type, a URL for a document that defines that item type or both. which is 
> the case?

The URL used for the itemtype="" attribute is an opaque string identifying 
the vocabulary. It could in reality be a URL that points to documentation 
for the vocabulary, or it could not resolve, or it could be anything else. 
For example, "http://microformats.org/profile/hcard" identifies the vCard 
Microdata vocabulary (as defined by the spec for that vocabulary, cited 
above), even though that URL does not have anything to do with microdata. 
The Licensing works vocabulary's type URL is "http://n.whatwg.org/work", 
which points to a text file that refers back to the spec.

The URL is not intended to be dereferenced by microdata processors.

> If there is no work on this I would like to propose the following. For 
> the purpose of simply validating:
> 
> - correct names
>
> - correct types (whether it's a name:value pair or a full item)
>
> - correct number of occurrences (Whether it can be an array of values or 
> just a single value, whether it is required or not)
> 
> It would suffice to [...]

Describing a format to describe formats is a problem that many people have 
attempted to address in the past century, with solutions such as ABNF, XML 
DTDs, XML Schema, various solutions for RDF, etc.

If a set of vocabularies can be completely specified using a particular 
syntax, that's great, and can be helpful for validators of those 
vocabularies. However, in practice, the range of conformance criteria is 
broad and no one readable syntax is going to be sufficient for all (or 
even many) vocabularies, so it's not a problem I think we should try to 
find a single solution for.

For example, in the vEvent microdata vocabulary, the "dtstart" property 
must have a value that matches a specific syntax, the "duration" property 
is mutually exclusive with the "dtend" property, and the "dtend" property 
is required to have a value that comes after the "dtstart" property in 
time. These are all conformance requirements that a validator is required 
to check if it supports the vEvent vocabulary, but that were not covered 
by your proposed metasyntax. The same problem has plagued HTML -- for a 
while its syntax was described in DTDs, for instance, which are completely 
inadequate to describe the actual conformance requirements of the 
language. With the contemporary HTML standard we've given up on using a 
DTD, and instead just describe everything in English. Implementations of 
validators then use whatever techniques they want to implement these; in 
the case of Henri's validator.nu, a combination of Schematron, RelaxNG, 
and Java code.

> In this sense an application consuming microdata could receive 2 inputs: 
> the html document containing the microdata and the set of 
> data-vocabularies definitions to validate the represented microdata.

A consuming application typically doesn't care if the content is valid. 
Only a validator typically checks that.

> Going further into this, we could also think about a datatype property 
> for specialised applications that may require them, etc.

Typically, data types are defined by the vocabulary specification. (Again, 
see the cited spec above for examples.) This doesn't need syntax.

> 3) The specification states that itemref references a node within the 
> html tree, referencing it by it's id. However it specifies nothing 
> regarding how the referenced node should be marked up. Since, the nodes 
> referenced may exist before the itemrefs, an application discovering 
> microdata may have to do multiple passes through the html tree to 
> extract this information. I would like to know, if any thought has been 
> given to using itemscope within the referenced node, ie:
> 
> <div itemscope id="a">
> 	<p itemprop="a1">value of a1</p>
> 	<p itemprop="a2">value of a2</p>
> </div>
> 
> <div itemscope id="b">
> 	<p itemprop="b1">value of b1</p>
> 	<div itemscope id="d" itemref="a"></div>
> </div>
> 
> Where a1="value of a1" and a2="value of a2" are childs belonging to the 
> item identified as d which is itself a child of b. The advantage of this 
> is that an application extracting the microdata could then extract all 
> elements marked up with itemscope and then merge them according to 
> itemref references without having to do multiple passes. This might not 
> be very important but could help to have better efficiency when 
> extracting microdata from big quantities of deep referenced documents or 
> when dealing with limited resources.

I'm not sure I fully understand what you mean here.

As a general rule, HTML cannot be parsed in one pass. To parse an HTML 
document you must generate a tree in-memory, which the parser can mutate 
arbitrarily during parsing. As a particularly bad case, consider this 
(non-conforming) markup:

   <body> <span itemprop="a">1</span> <body itemscope>

...when parsed as HTML, this results in the following tree:

   |
   +- HTML
       |
       +- HEAD
       |
       +- BODY itemscope
           |
           +- SPAN itemprop=a
               |
               +- "1"

That is, exactly the same as:

   <html><head></head><body itemscope=""> <span itemprop="a">1</span> </body></html>

There is no sane way to parse this in one pass.

(Note: The same problem exists with anything based on HTML, including, 
e.g., microformats or RDFa.)

> 4) What is the intended behaviour of an application when encountering a 
> loop within the itemref references? ie:
> 
> <div itemscope id="a" itemref="b c d"></div>
> 
> <p id="b"><span itemprop="x">x value</span></p>
> <div id="c">
> 	<p>Y:<span itemprop="y">y value</span></p>
> 	<p>Z: <span itemprop="z">z value</span></p>
> </div>
> <div itemscope id="d" itemref="a"></div>
> 
> In a case like this, should the whole node with id="a" be discarded or 
> only the subnode with id="d"? Or is this up to the implementor?

You just follow these algorithms:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#associating-names-with-items

There's no loop here. You get two items, one with three properties { x:"x 
value", y:"y value", z:"z value"}, and one with no properties.

> 5) The specification states:
> 
> "The itemref attribute, if specified, must have a value that is an 
> unordered set of unique space-separated tokens that are case-sensitive, 
> consisting of IDs of elements in the same home subtree."
> 
> I would like to know if there has been any thoughts given to referencing 
> fragments on an outside document.

Yes, though we haven't added anything yet. Cross-document references are 
pretty complicated and introduce all kinds of trust issues, rules about 
when to fetch the other document, etc. If microdata gets much use, we 
might look into adding this later.

In the meantime, I recommend having documents reference each other using 
properties, e.g. in thisdoc.html:

  <link itemprop="more-properties" href="otherdoc.html#foo">

...where "foo" is the id="" of an element in otherdoc.html that has an 
itemscope, itemtype, and itemid equal to that in thisdoc.html, with the 
definition that the properties from both items should be merged in the 
user agent processing the microdata.

> For example, a document with URL 
> http://www.personaldata.com/me.html might contain the following 
> fragment:
> 
> <div itemscope itemtype="http://www.datavocabulary.com/person">
> 	<p>My name is <span itemprop="name">Pepe</span> and I used work at <a
> itemprop="org" href="http://www.organization.com/about_us.html#org_data">organization</a></p>
> </div>
> 
> While at http://www.organization.com/about_us.html#org_data you could 
> have the following fragment:
> 
> <div id="org_data" itemtype="http://www.datavocabulary.com/org">
> 	<p itemprop="legal_name">Organization XYZ</p>
> ....
> </div>

You can do that too, yup. You just need to define the "org" property in 
the "http://www.datavocabulary.com/person" vocabulary as accepting a URL 
that is processed using microdata.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'