[whatwg] Microdata feedback

Ian Hickson ian at hixie.ch
Mon Jan 18 04:58:16 PST 2010


On Thu, 12 Nov 2009, Philip Jägenstedt wrote:
>
> I've been playing with the microdata DOM APIs again, continuing the 
> JavaScript experimental implementation 
> <http://gitorious.org/microdatajs>. It's not small or elegant, but at 
> least some spec issues have come up in the process.
> 
> What is the http://www.w3.org/1999/xhtml/microdata# URI?

It provides a way to map microdata property names to URLs in an 
unambiguous way.



> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#associating-names-with-items
> 
> "Otherwise, if one of the other elements in pending is an ancestor 
> element of candidate, and that element is scope, then remove candidate 
> from pending."
> 
> "Otherwise, if one of the other elements in pending is an ancestor 
> element of candidate, and that element also has scope as its nearest 
> ancestor element with an itemscope attribute specified, then remove 
> candidate from pending."
> 
> The intention of these requirements seems to be to eliminate redundant 
> elements in pending, but a comment on the intention of each in the spec 
> would be helpful as it's quite cryptic right now.

Added some brief explanations.



> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#microdata-dom-api
> 
> itemtype and itemid are both URL attributes and therefore when getting
> itemType and itemId relative URLs should be resolved (even if only absolute
> URLs are valid). Correct?

That was a correct interpretation of the spec, but was only intended to 
be the case for itemid. I've corrected the spec to say that itemType is 
just a regular DOMString with no resolution.


> itemprop and itemref are both "unordered set of unique space-separated
> tokens", but in HTMLElement only itemProp is a DOMSettableTokenList while
> itemRef is a DOMString. This doesn't really make sense, so make itemRef a
> DOMSettableTokenList too?

Fixed. That was an oversight.


> From reading the spec it's not obvious (without following cross- 
> references) that itemProp isn't just a plain string. An example using 
> .itemProp.contains(name) or similar would make this more difficult to 
> miss.

Done.



> http://www.whatwg.org/specs/vocabs/current-work/#vcard
> 
> Having clickable cross-references in this spec would help a lot when
> reviewing!

I've put them back in the HTML5 spec, which makes this a moot point.


> Grammar: Let value *be* the result of collecting the first vCard 
> subproperty named value in subitem.

Fixed.


> "Let n1 be the value of the first property named family-name in subitem, or
> the empty string if there is no such property or the property's value is
> itself an item." Why not use "collecting the first vCard subproperty" here?
> Not doing so had me trying to find how the two were different, but I couldn't
> find any differences given that the values are later escaped.

Oops. Fixed.


> There's also the issue of how newlines from textContent values are escaped.
> Applying the vCard extraction algorithm to the spec example gives:
> 
> BEGIN:VCARD
> PROFILE:VCARD
> VERSION:3.0
> SOURCE:http://foolip.org/microdatajs/demo/vcard.html
> NAME:vCard demo
> FN:Jack Bauer
> PHOTO;VALUE=URI:http://foolip.org/microdatajs/demo/jack-bauer.jpg
> ORG:Counter-Terrorist Unit;Los Angeles Division
> ADR:;;10201 W. Pico Blvd.;Los Angeles;CA;90064;United States
> GEO:34.052339;-118.410623
> TEL;TYPE=work:+1 (310)\n  597 3781
> URL;VALUE=URI:http://en.wikipedia.org/wiki/Jack_Bauer
> URL;VALUE=URI:http://www.jackbauerfacts.com/
> EMAIL:j.bauer at la.ctu.gov.invalid
> TEL;TYPE=cell:+1 (310) 555\n  3781
> NOTE:If I'm out in the field\, you may be better off\n contacting Chloe O'B
> rian if it's about\n work\, or ask Tony Almeida if\n you're interested in
> the CTU five-a-side football team we're trying\n to get going.
> AGENT;VALUE=VCARD:BEGIN:VCARD\nPROFILE:VCARD\nVERSION:3.0\nSOURCE:http://fo
> olip.org/microdatajs/demo/vcard.html\nNAME:vCard demo\nEMAIL\;VALUE=URI:ma
> ilto:c.obrian at la.ctu.gov.invalid\nFN:Chloe O'Brian\nN:O'Brian\;Chloe\;\;\;
> \nEND:VCARD\n
> AGENT:Tony Almeida
> REV:2008-07-20T21:00:00+0100
> TEL;TYPE=home:01632 960 123
> N:Bauer;Jack;;;
> END:VCARD
> 
> TEL and NOTE has line breaks that are just because of how the HTML source is
> formatted. Importing this into Gmail preserves these linebreaks which looks
> quite broken. Unless we expect text fields to contain meaningful formatting,
> perhaps simply collapsing all whitespace into a single space is OK? In the
> best of worlds <br> would be converted to \n, but I'm not sure if it's worth
> the trouble.

We're screwed either way. If we convert newlines to " ", then we lose 
formatting from <pre>. If we don't convert newlines, we gain spurious 
linebreaks (and spaces). The latter is less destructive, which is why I 
picked it, but it's not ideal, I agree.

I'd like at some point to introduce some sort of "semantic" textContent 
that handles <br>, <pre>, <bdo>, dir="", <img alt>, <del>, space- 
collapsing, and newline elimination, but there hasn't been much enthusiasm 
around the idea, and it's not clear what else it would be good for.

I've changed the example, at least, to have it work ok, and added a 
comment in the example about it.


> Finally on vCard, the final part of the extraction algorithm goes to 
> great trouble to guess what is the family name and what is the given 
> name. This guess will be broken for transliterated east Asian names 
> (CJKV that I know of, maybe others too). Just saying. Also, why is it 
> important to explicitly add N:;;;; for organizations?

This is intended to be compatible with Microformats vCard, which has 
these weird rules. If you think we should remove them, please at least 
first speak to Tantek and see why he thinks.



> http://www.whatwg.org/specs/vocabs/current-work/#vevent
> 
> "Add an iCalendar line with the type name and the value value to output."
> 
> At this point value is undefined.

Fixed.


> Given the algorithm for extracting iCal, it seems that dtstart and dtend must
> be specified using <time datetime="">, as it's only for time elements that the
> time stamps will be properly formatted (stripping - and :)

Yes. I've made this explicit (as with the URL ones).


> There are some errors in the example. I got it working by applying this diff:
> 
> --- vevent.js.orig	2009-11-11 10:52:37.000000000 +0100
> +++ vevent.js	2009-11-11 23:54:15.000000000 +0100
> @@ -1,3 +1,3 @@
> function getCalendar(node) {
> -  while (node && (!node.nodeScope || !node.itemType == 'http://microformats.org/profile/hcalendar#vevent'))
> +  while (node && (!node.itemScope || !node.itemType == 'http://microformats.org/profile/hcalendar#vevent'))
>     node = node.parentNode;

Fixed.


> @@ -26,3 +26,3 @@
>       value = value.replace(/;/g, '\\;');
> -      value = value.replace(/,/g, \\,');
> +      value = value.replace(/,/g, '\\,');
>       value = value.replace(/\n/g, '\\n');

Fixed.


> @@ -31,3 +31,3 @@
>       var name = prop.itemProp[nameIndex];
> -      if (!name.match(':') && !name.match('.'))
> +      if (!name.match(':') && !name.match('\\.'))
>         calendar += name.toUpperCase() + parameters + ':' + value + '\r\n';
> 
> Perhaps /\./ would be better to make it clear that it's a regexp.

Fixed.


> Also: if (prop.date && prop.time)
> 
> date and time aren't properties on HTMLTimeElement, I don't know what this is.

Oops. We removed .date and .time on the basis that there were no use 
cases, apparently oblivious to the use case in the spec already!


> Is there or should there be a DOM API for determining if a string is a valid
> date string other than implementing those algorithms in script?

There isn't... if you can assume the input is valid (e.g. if you wrote it 
yourself) then distinguishing a date from a datetime is easy (just test it 
against .match(/T/)), but if you want to implement the algorithm in the 
spec, you need more code.



> http://www.whatwg.org/specs/vocabs/current-work/#licensing-works
> 
> What's the n in http://n.whatwg.org/work? If this URL is going to stick, 
> it would be nice if there were also something to be seen at that page.

I've made it redirect to the spec.


> Also, the conversion to RDF section isn't really useful and seems to 
> hide some assumptions about how the properties vocabulary should be 
> prefixed with http://n.whatwg.org/work and how the 
> http://www.w3.org/1999/xhtml/microdata# prefix is supposed to be used.

I've tried to add more information here.



> http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#domtokenlist
> 
> The DOM intro box doesn't explain the return value for .toggle(), you 
> have to consult the algorithm to figure it out.

Fixed.



On Sat, 14 Nov 2009, Philip Jägenstedt wrote:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/converting-html-to-other-formats.html#json
> 
> This was easy to implement, but the algorithm isn't guaranteed to terminate.
> 
> <div itemscope>
>  <div itemprop="foo" itemscope itemref="oops" id="oops"></div>
> </div>
> 
> This simple input causes the algorithm to recurse as the item references 
> itself.
>
> I went back to the vCard algorithm and found that it too will fail to
> terminate with this input:
> 
> <span itemscope itemtype="http://microformats.org/profile/hcard">
>  <span itemprop="agent" itemscope id="oops" itemref="oops"
>        itemtype="http://microformats.org/profile/hcard">
> </span>
> 
> vEvent is safe as the algorithm never recurses, but the RDF conversion
> algorithm would hit the same problem.

Hmm. For RDF it's a non-problem, we could just make sure that you don't 
generate the triples for a particular item more than once. For vCard and 
JSON, though, it's indeed an issue.


> The itemref mechanism allows creating arbitrary graphs of items, rather than
> the tree of items that is the intended microdata model (right?).

Graphs are intended to be supported in v2, using a mechanism 


> It's certainly possible to create loops which are less easy to spot:
> 
> <div itemscope>
>  <div itemprop="prop1" itemscope itemref="id2" id="id1"></div>
>  <div itemprop="prop2" itemscope itemref="id3" id="id2"></div>
>  ...
>  <div itemprop="propn" itemscope itemref="id1" id="idn"></div>
> </div>
> 
> Or this:
> 
> <div itemscope>
>  <div itemprop="foo" itemscope id="a">
>    <div itemprop="bar" itemscope itemref="a"></div>
>  </div>
> </div>

Or:

   <div itemscope itemref="a"></div>
   <div itemprop="p" itemscope id="a" itemref="b"></div>
   <div itemprop="q" itemscope id="b" itemref="a"></div>

...which creates an infinite tree:

   root item
     p: item
         q: item
             p: item
                 q: item
                     ...ad nauseum...

I've changed the way itemref="" is processed so that it catches loops and 
drops on the floor any nodes involving loops.


On Fri, 13 Nov 2009, Tab Atkins Jr. wrote:
> 
> Looping in data-graphs is often useful, so I'm not sure I want to throw 
> it out generally.  Your statement in the first paragraph I'm quoting, 
> though, says that you'd rather leave loops to be defined in the 
> vocabulary itself?  So loops would be done by, frex, itemprop'ing a link 
> to the other element rather than itemref'ing the other element directly?
> 
> That would probably be fine, and is compatible with a tree-based data 
> model like JSON.  Vocabs should know when loops are 
> permissible/desirable for themselves.

Indeed.


On Sat, 14 Nov 2009, Philip Jägenstedt wrote:
> 
> Yes, that's basically what I'm saying. One option is to simply use 
> microdata such that the RDF you extract is the graph you want (it will 
> probably look quite ugly though). Another is always referencing subitems 
> by a mechanism other than refid. For example, in the MusicBrainz XML 
> webservice when an artist contains a release which itself references 
> artists (e.g. as the producer), a stub item is used with only artist 
> name and id, rather than including all information recursively. In 
> microdata I would do:
> 
> <section itemscope
> itemtype="http://musicbrainz.org/artist/"
> itemid="http://musicbrainz.org/artist/4d5447d7-c61c-4120-ba1b-d7f471d385b9">
>  <h1 itemprop="name">John Lennon</h1>
>  <section>
>   <h1>Releases</h1>
>   <section itemprop="release"
>    itemscope
>    itemtype="http://musicbrainz.org/release/"
>    itemid="http://musicbrainz.org/release/f237e6a0-4b0e-4722-8172-66f4930198bc"> 
>    <h1>Imagine</h1>
>    Producer:
>    <span itemprop="producer"
>     itemscope
>     itemtype="http://musicbrainz.org/artist/"
>     itemid="http://musicbrainz.org/artist/e7b587f7-e678-47c1-81dd-e7bb7855b0f9"
>     ><span itemprop="name">Phil Spector</span></span>
>   </section>
>  </section>
> </section>
> 
> Even if John Lennon were the producer here, you don't get any looping in 
> the microdata itself. If you want to know everything about the producer, 
> you should just follow the itemid... I haven't looked that much at the 
> RDF extraction algorithm yet, but I think this example might even create 
> the proper graph with loops if the producer were John Lennon.

Yes. There'd be two John Lennon items, though.


On Tue, 17 Nov 2009, Philip Jägenstedt wrote:
>
> http://www.whatwg.org/specs/vocabs/current-work/#examples
> 
> The Jack Bauer example has validation issues (using http://validator.nu/)
> 
> My fix:
> 
> --- jack.html.orig	2009-11-17 11:03:03.000000000 +0100
> +++ jack.html	2009-11-17 11:03:19.000000000 +0100
> @@ -41,12 +41,12 @@
>  you're interested in the CTU five-a-side football team we're trying
>  to get going.</p>
> - <ins datetime="2008-07-20T21:00:00+0100">
> + <ins datetime="2008-07-20T21:00:00+01:00">

Fixed.


>   <span itemprop="rev" itemscope>
>    <meta itemprop="type" content="date-time">
> -   <meta itemprop="value" content="2008-07-20T21:00:00+0100">
> +   <meta itemprop="value" content="2008-07-20T21:00:00+01:00">

Fixed.


>   </span>
>   <p itemprop="tel" itemscope><strong>Update!</strong>
>   My new <span itemprop="type">home</span> phone number is
> -  <span itemprop="value">01632 960 123</span>.
> +  <span itemprop="value">01632 960 123</span>.</p>
>  </ins>
> </section>

Fixed.


On Thu, 19 Nov 2009, Philip Jägenstedt wrote:
>
> In a (slightly edited) Jack Bauer example [1], Chrome, Firefox and 
> presumably Safari has the meta elements moved to head. This will 
> severely break script-based implementation of microdata, which are 
> likely to be used for the time being until the DOM API is implemented 
> natively. I can't see any workaround for this, so I suggest that <meta> 
> simply not be used for microdata, preferably by making it non-conforming 
> and removing it from the definitions/algorithms.

This is a short-term problem that only affects scripted implementations 
that are shipped with the pages, so the workaround is simple: don't use 
<meta> and <link>. Any implementations outside of the page can just fix 
their parser to be HTML5-compatible.


> For <link>, the rel attribute issue [2] needs to be settled. It seems to 
> me that sometimes requiring rel and sometimes not makes for a less 
> consistent language with more room for error.

No version of HTML has ever required rel="" (in the past <link> was valid, 
but even ignoring that, you could have used rev="" instead of rel=""). 
People seem to have survived. I don't think it's a huge problem to use it 
with itemprop="" instead.


> I hesitate to make an argument based on aesthetics, but I think 
> repurposing either <link> or <meta> for use in microdata is decidedly 
> ugly, mostly because my legacy understanding of them is as "<head> only 
> elements". In the usability study [3] there was only one example which 
> used <link> and <meta> [4]. Was there any indication then that any of 
> the test subjects were put off by either <link> or <meta>?

None that I recall, but I'd have to review the tapes to be sure.


> Both <item> and <link> are used only to include non-visible metadata in 
> the item. Philip Taylor points out in IRC that these work equally well:
> 
> <span hidden itemprop=foo>bar</span> (instead of meta)

That's bogus. Ideally we'd in fact make the microdata algorithm actively 
drop everything inside a hidden="" block, since it means it's not 
relevant. That would require even more tree walking, though, so I'd rather 
no do that.


> <a itemprop=foo href=bar></a> (instead of link)

That's far more ugly than <link>, IMHO! :-)


On Thu, 26 Nov 2009, Tim van Oostrom wrote:
>
> Hi, I made a forumpost : http://forums.whatwg.org/viewtopic.php?t=4176, 
> concerning a possible "microdata specification bug" and a bug in the 
> james.html5.org microdata extractor.
> 
> Comes down to <link/> and <meta/> elements possibly being unfit for use 
> with the itemscope attribute.
> 
> I made an example in the forum post with some nice ubb formatting .

You're right that using itemscope="" with <link> and <meta> isn't 
particularly useful. It's even more pointless on <br>, or <script>, or 
<style>. You MAY use it on any element, but it's indeed not really useful 
in most cases. It's just easier for everyone if we don't say "it can be 
used on MOST elements, but here's a list of exceptions".

Regarding the second point, I wasn't sure what you meant. id="city" should 
be fine.

I didn't really follow your other e-mails; are there any changes you think 
should be made to the spec?


On Sun, 29 Nov 2009, Philip Jägenstedt wrote:
> 
> Yes, the spec certainly needs some notes on how to use <link> and 
> <meta>.

There are some examples that use them... What do you think should be 
mentioned in a note, if we add one?


> Now, back to the problem of one property, multiple items. The algorithm 
> for finding the properties of an item [2] is an attempt at optimizing 
> the search for properties starting at an item element. I think we should 
> replace this algorithm with an algorithm for finding the item of a 
> property. This was previously the case with the spec before the itemref 
> mechanism.

This would preclude using the same itemprops from multiple itemrefs. 
That's not a particularly strong use case, but it seems to odd to disallow 
it given the markup feature.


On Mon, 30 Nov 2009, Philip Jägenstedt wrote:
>
> This way the microdata model is kept strictly tree-like.

The model is still tree-like. It's just that the mapping of the model to 
the DOM has multiple nodes mapped to the same DOM nodes.


On Tue, 22 Dec 2009, Futomi Hatano wrote:
> 
> I wonder if there is a typo in the example of Microdata Vocabularies: vCard.
> http://www.whatwg.org/specs/vocabs/current-work/#examples
> There are three examples. Could you see the second example?
> 
> <strong title="fn">Alfred Person</strong>
> 
> I wonder if the title attribute is incorrect.
> The itemprop attribute is correct, isn't it?
> 
> <strong itemprop="fn">Alfred Person</strong>

Thanks, fixed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list