[whatwg] Possible bugs : Microdata Itemscope on <link/> and <meta/>

Sun Nov 29 05:28:05 PST 2009

On Sun, 29 Nov 2009 12:46:16 +0100, Tim van Oostrom <tim at depulz.nl> wrote:

> Philip Jägenstedt wrote:
>> On Thu, 26 Nov 2009 22:30:41 +0100, Tim van Oostrom <tim at depulz.nl>  
>> wrote:
>>
>>> Hi, I made a forumpost :  
>>> http://forums.whatwg.org/viewtopic.php?t=4176, concerning a possible  
>>> "microdata specification bug" and a bug in the james.html5.org  
>>> microdata extractor.
>>>
>>> Comes down to <link/> and <meta/> elements possibly being unfit for  
>>> use with the itemscope attribute.
>>>
>>> I made an example in the forum post with some nice ubb formatting .
>>>
>> There are some other issues with <link> and <meta> you might want to  
>> review first: [1]
> Ok
>> Your second example was:
>>
>> <div itemtype="http://url.to/geoVocab#country" itemscope>
>>    <span itemprop="http://xmlns.com/foaf/spec/index.rdf#name"  
>> lang="cn">中華人民共和國</span>
>>    <span itemprop="http://xmlns.com/foaf/spec/index.rdf#name"  
>> lang="en">China</span>
>>    <link itemprop="http://url.to/city" href="http://url.to/shanghai"  
>> itemscope itemref="city-shanghai" />
>>    <div id="city-shanghai">
>>       <span  
>> itemprop="http://xmlns.com/foaf/spec/index.rdf#name">Shanghai</span>
>>       <span itemprop="http://url.to/demoVocab#population">14.61 million  
>> people</span>
>>       <span itemprop="http://url.to/physicsVocab#time"  
>> datetime="2009-11-26 11:43">11:43 pm (CT)</span>
>>    </div>
>> </div>
>>
>> By using itemprop+itemscope, you're saying that the property is itself  
>> an item. Also specifying href="http://url.to/shanghai" does nothing.
> I also pointed that out in my forumpost.

I probably didn't read it all closely enough to see which parts were  
misunderstandings and which parts were intentionally testing the limits of  
the spec, sorry if I hit you on the head with things you already knew.

>> <link>, <meta> and any other void elements are usually the wrong choice  
>> for itemprop+itemscope because they don't have child elements, so  
>> itemref is the only way to add properties.
> Yes, see forumpost. Shouldn't this be noted in the Spec then ?  (maybe i  
> read over it)

Yes, the spec certainly needs some notes on how to use <link> and <meta>.

>> What you've accidentally done above is add the 3 properties of Shanghai  
>> to both the top-level item and the sub-item, see [2] for details.
> Well, i did it in full awareness, i interpreted the itemref attribute  
> like this. But if it can't be used this way, isn't this a setback on the  
> flexibility of itemref? Or was it intended this way.
> According to this an "itemref" attribute can never be added to an "item"  
> within an itemscope of another "item" without the crawled prop/val pairs  
> also applying to the ancestors itemscope.
>

Ah, I think you've found the root of the problem. By allowing a property  
to be part of several items at once, we get different kinds of strange  
problems. Except from messing up your example, it seems it is the real  
cause for the infinite recursion bug I wrote about in [1]. Then I was so  
focused on the recursion that I suggested a rather complex solution to  
detect loops in the microdata, when it seems it could be solved simply be  
making sure that a property belongs to only 1 item. Detailed suggestion  
below.

[snip Amanda example]

[snip most of what Shanghai is to China]

> Assigning "unique" properties to Subjects for RDF's sake doesn't seem  
> like a good idea to me.
> Ofcourse i can make other, more sensible, html markup but the whole  
> point of a solid annotating language is that i can apply it to my  
> existing markup without changing it.

Hehe, I wasn't suggesting a unique predicate for each combination of  
country and city, just trying to sort out what is the itemprop and what is  
the itemtype in your example. If you want RDF I suppose you have to find  
some vocabulary that has a predicate that makes sense as China *predicate*  
Shanghai.

>> If <http://url.to/shanghai> is a global identifier for Shanghai you  
>> should use itemid.
>
> Correct, but the href="" is ignored in that example. If it only  
> concerned a property i'd use :
>
> <link itemprop="http://url.to/city" href="http://url.to/shanghai" />
>
> and it would be valid. I used it to show my point about <link/> and  
> itemscope, see forumpost.
>

In this example, I really don't see why you would use <link> to make a  
sub-item to begin with. Perhaps we should make itemscope on <link> and  
<meta> invalid in order to warn people of the problem?

>>
>> I don't know what <http://url.to/physicsVocab#time> is, but note that  
>> an exact time isn't very useful without a timezone, so I added the PRC  
>> timezone for you. I'll also note that using traditional Chinese for the  
>> full name of the PRC is an odd choice, so I changed it to simplified  
>> Chinese above.
>>
>> Marking up the population as "14.61 million people" isn't terribly  
>> helpful if you want a computer to be able to find the city with the  
>> biggest population among several cities, unless your vocabulary defines  
>> how to parse "14.61 million people" into a number, which would be  
>> strange. In any case this is hidden metadata unless you want 14610000  
>> or some other easily machine-parsable representation to be visible in  
>> the page rendering.
>
> Ok but using content="" on a span, is that valid ? Your suggestion in  
> [1] would be nicer. But i prefer the use of <link/> and <meta/> elements.
>

Oops, my suggestion is broken, I mean to write:

<span hidden itemprop="http://url.to/demoVocab#population">14610000</span>

>> Finally, I think <http://xmlns.com/foaf/spec/index.rdf#name> should be  
>> <http://xmlns.com/foaf/0.1/name>. If you're going to use existing  
>> vocabularies like FOAF and want your data to be play nice with the RDF  
>> world, make sure to check that the result of the RDF extraction  
>> algorithm [3] is what you intended.
> This remains unclear to me. For example : http://xmlns.com/foaf/0.1/name  
> redirects to an html page but : http://purl.org/dc/terms/title redirects  
> to an .rdf file. For readability, wouldn't  
> http://xmlns.com/foaf/spec/#term_name and  
> http://dublincore.org/documents/dcmi-terms/#elements-title be better  ?
> If i dereference these url's i find the information i want in one stop.

If you want to reuse an RDF vocabulary you have to use byte-by-byte the  
exact same URIs. I agree that finding the correct URI is quite difficult  
because treating them as a URL and looking them up in a browser will often  
redirect you. I got <http://xmlns.com/foaf/0.1/name> by looking at the  
namespace declaration used for foaf and appending "name".

>> In particular, you probably want to use itemid where possible and make  
>> sure that all your URIs are exactly correct. Personally, though, unless  
>> I could reuse existing vocabularies for every single item and property,  
>> I would only use a full URI for itemtype and point that to a vocabulary  
>> that defines what simpler property names like "name" and "city" mean  
>> and how to convert the vocabulary to RDF.
>>
>> [1]  
>> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-November/024116.html  
>> [2]  
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-properties-of-an-item  
>> [3]  
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/converting-html-to-other-formats.html#rdf
>
>

Now, back to the problem of one property, multiple items. The algorithm  
for finding the properties of an item [2] is an attempt at optimizing the  
search for properties starting at an item element. I think we should  
replace this algorithm with an algorithm for finding the item of a  
property. This was previously the case with the spec before the itemref  
mechanism. I would suggest something along these lines:

1. let current be the element with the itemprop attribute
2. if current has an ID, for each element e in document order:
2.1. if e has an itemref attribute:
2.1.1. split the value of that itemref attribute on spaces. for each  
resulting token, ID:
2.1.1.1. if ID equals the ID of current, return e
3. reaching this step indicates that the item wasn't found via itemref on  
this element
4. let parent be the parent element of current
5. if parent is null, return null
6. if parent has the itemscope attribute, return parent
7. otherwise, let current be parent and jump to step 2.

This algorithm will find the parent item of a property, if there is one.  
itemref'ing takes precedence over "parent-child linking", so in Tim's  
example the properties of Shanghai would be applied to only the Shanghai  
sub-item. I'm not convinced writing markup like that is a good idea, but  
at least this way it has sane processing. HTMLPropertiesCollection on any  
given element would simply match all elements in the document for which  
the the algorithm returns that very element. It should be invalid for  
there to be any elements in the document with itemprop where this  
algorithm returns null or the element itself.

I will try implementing this algorithm in MicrodataJS [3] and see if it  
works OK. While it may look less efficient than the current algorithm,  
consider that a browser won't implement either algorithm as writting, only  
act as if they did. The expensive step of going through all elements with  
itemref attributes is actually no more expensive than e.g.  
document.querySelector('.classname') if implemented natively.

[1]  
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-November/024095.html
[2]  
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-properties-of-an-item
[3] http://gitorious.org/microdatajs

-- 
Philip Jägenstedt