[whatwg] Issues with microdata and proposals for improvements

Thu Oct 11 04:26:41 PDT 2012

Hello,

I am writing a set of tools to work with microdata, and ran into a
number of issues. Is there at this point still room for discussion,
and improvements to the specification?

For what it is worth, here are some of the things I ran into, and
proposals to make it better:

== Usage of URLs that do not point to anything interesting ==

I'm not sure whether this has been discussed in length, though it
seems that Philip Jägenstedt brought it up once [1]. For a variety of
reasons, I would much rather use <data> and <a> than <meta> and <link>
for microdata: less ugly, script has easy access to the user visible
representation of data, and CSS styling of that representation based
on microdata attributes (itemref complicates this - see below), etc.

However, for enumerations like http://schema.org/InStock a clickable
<a> would not be desirable, yet the use of <data> would violate the
microdata specification, section "Values":

"If a property's value, as defined by the property's definition, is an
absolute URL, the property must be specified using a URL property
element."

I do not see much merit in this requirement: the URL is already
absolute, so it does not need resolving and it is already defined to
be a URL by the property's definition. Therefore storing it in a
<data> element would not do much harm. Because there are many benefits
to being able to wrap visible content in a microdata property, I would
like to propose that this requirement is dropped, so the <data>
element may also carry an absolute URL.

Nevertheless, I see how it would be useful to store a URL in such a
way that it is clear it's a URL, and have it properly resolved. For as
far as I can tell, no HTML element combines the following three
properties:
1. Stores a definite URL type value,
2. Can have phrasing content,
3. Has no side effects (clickable, etc).

Therefore, as an alternative to dropping the requirement mentioned
above, I would also be in favor of allowing an additional attribute on
the <data> element (for example named 'url'), mutually exclusive with
the 'value' attribute, that is to be resolved the same way as the URLs
obtained from <a>, <link>, <img>, etc are.

== Incompatible property names when using itemrefs ==

Consider the following piece of HTML:

<div itemscope itemtype="http://schema.org/Book" itemref="a"> ... </div>
<div itemscope itemtype="http://schema.org/LiteraryEvent" itemref="b">
... </div>
<div id="a" itemprop="author" itemscope
itemtype="http://schema.org/Person" itemref="c"></div>
<div id="b" itemprop="performer" itemscope
itemtype="http://schema.org/Person" itemref="c"></div>
<div id="c">
	 Name: <span itemprop="name">Amanda</span>
</div>

Actually, the 'Book' item and the 'LiteraryEvent' item both want to
refer to the same person: the first as the author, the second as a
performer. Because the property names differ, I can't seem to find a
proper way to do this using itemrefs, without either polluting other
items, or creating two 'Person' items (as I did above). Both
approaches are undesirable.

An alternative way of using the itemref attribute, which makes much
more sense to me, would lead to this:

<div itemscope itemtype="http://schema.org/Book">
	Author: <a itemprop="author" itemref href="#a">Amanda</a>
	...
</div>
<div id="b" itemscope itemtype="http://schema.org/LiteraryEvent">
	Speaking: <a itemprop="performer" itemref href="#a">Amanda</a>
	...
</div>
<div id="a" itemscope itemtype="http://schema.org/Person">
	Name: <span itemprop="name">Amanda</span>
	Near you: <a itemprop="performerIn" itemref href="#b">reading from
her new book</a>
</div>

Formally:
If an element has both the attributes itemprop and itemref, but not
itemscope, and itemref is empty, then it should have a URL type value
that points to another element that is an item. This item, if it
exists in the same document, will be the property's value. If not, the
URL will be used.

This has a few consequences:

1. It opens the door to pointing to microdata in other documents.
Although a browser probably shouldn't try to fetch it, this can be
useful for search engines.
2. If (1.) were to be allowed, it would be best if the microdata DOM
API exposes whether a property value is intended as a reference to an
external item or is just a URL.
3. It makes more sense to allow cycles in the graph created by the
items in a page, as created with the 'performerIn' property on
'Person' in the example.

I think these changes are compatible with current use, because right
now itemref is not to be used on elements without itemscope.
The only issue I see is that the microdata DOM API could now present
cyclic graphs. It is not yet deployed anywhere, is it? Anyway, for
people using it on their own data it shouldn't be a problem.

In my opinion, there are great benefits to the alternative itemref approach:

1. The issue with incompatible property names is eliminated.
2. Possibility to refer to external data.
3. For most purposes microdata would better match document structure
as presented to the user.
4. It more closely resembles common data models, making it easier to
serialize them into microdata.
5. It is possible to mark-up more complex graphs in HTML documents this way.
6. With only this use of itemref, and forsaking nested items, CSS
styling based on microdata attributes becomes very feasible.

I'd be interested to hear what people think.

Thank you for reading,

Josh

[1] http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-November/024116.html