[whatwg] Exposing known data types in a reusable way

Tue May 19 16:07:15 PDT 2009

Some of the use cases I collected from the e-mails sent in over the past 
few months were the following:

   USE CASE: Exposing contact details so that users can add people to their
   address books or social networking sites.

   SCENARIOS:
     * Instead of giving a colleague a business card, someone gives their
       colleague a URL, and that colleague's user agent extracts basic
       profile information such as the person's name along with references to
       other people that person knows and adds the information into an
       address book.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about who he is to add it to
       their contact databases.
     * Fred copies the names of one of his Facebook friends and pastes it
       into his OS address book; the contact information is imported
       automatically.
     * Fred copies the names of one of his Facebook friends and pastes it
       into his Webmail's address book feature; the contact information is
       imported automatically.
     * David can use the data in a web page to generate a custom browser UI
       for including a person in our address book without using brittle
       screen-scraping.

   REQUIREMENTS:
     * A user joining a new social network should be able to identify himself
       to the new social network in way that enables the new social network
       to bootstrap his account from existing published data (e.g. from
       another social nework) rather than having to re-enter it, without the
       new site having to coordinate (or know about) the pre-existing site,
       without the user having to give either sites credentials to the other,
       and without the new site finding out about relationships that the user
       has intentionally kept secret.
       (http://w2spconf.com/2008/papers/s3p2.pdf)
     * Data should not need to be duplicated between machine-readable and
       human-readable forms (i.e. the human-readable form should be
       machine-readable).
     * Shouldn't require the consumer to write XSLT or server-side code to
       read the contact information.
     * Machine-readable contact information shouldn't be on a separate page
       than human-readable contact information.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, vCard) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of a contact to be given in
       different parts of the page. For example, a page with contact details
       for people in columns (with each row giving the name, telephone
       number, etc) should still have unambiguous grouped contact details
       parseable from it.
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

   USE CASE: Exposing calendar events so that users can add those events to
   their calendaring systems.

   SCENARIOS:
     * A user visits the Avenue Q site and wants to make a note of when
       tickets go on sale for the tour's stop in his home town. The site says
       "October 3rd", so the user clicks this and selects "add to calendar",
       which causes an entry to be added to his calendar.
     * A student is making a timeline of important events in Apple's history.
       As he reads Wikipedia entries on the topic, he clicks on dates and
       selects "add to timeline", which causes an entry to be added to his
       timeline.
     * TV guide listings - browsers should be able to expose to the user's
       tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
     * Paul sometimes gives talks on various topics, and announces them on
       his blog. He would like to mark up these announcements with proper
       scheduling information, so that his readers' software can
       automatically obtain the scheduling information and add it to their
       calendar. Importantly, some of the rendered data might be more
       informal than the machine-readable data required to produce a calendar
       event.
     * David can use the data in a web page to generate a custom browser UI
       for adding an event to our calendaring software without using brittle
       screen-scraping.
     * http://livebrum.co.uk/: the author would like people to be able to
       grab events and event listings from his site and put them on their
       site with as much information as possible retained. "The fantasy would
       be that I could provide code that could be cut and pasted into someone
       else's HTML so the average blogger could re-use and re-share my data."
     * User should be able to subscribe to http://livebrum.co.uk/ then sort
       by date and see the items sorted by event date, not publication date.

   REQUIREMENTS:
     * Should be discoverable.
     * Should be compatible with existing calendar systems.
     * Should be unlikely to get out of sync with prose on the page.
     * Shouldn't require the consumer to write XSLT or server-side code to
       read the calendar information.
     * Machine-readable event data shouldn't be on a separate page than
       human-readable dates.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, iCalendar) in a consistent manner, so that tools that use
       this information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of an event to be given in
       different parts of the page. For example, a page with calendar events
       in columns (with each row giving the time, date, place, etc) should
       still have unambiguous calendar events parseable from it.
     * Should be possible for authors to find out if people are reusing the
       information on their site.
     * Code should not be ugly (e.g. should not be mixed in with markup used
       mostly for styling).
     * There should be "obvious parsing tools for people to actually do
       anything with the data (other than add an event to a calendar)".
     * Solution should not feel "disconnected" from the Web the way that
       calendar file downloads do.
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

   USE CASE: Allow users to maintain bibliographies or otherwise keep track
   of sources of quotes or references.

   SCENARIOS:
     * Frank copies a sentence from Wikipedia and pastes it in some word
       processor: it would be great if the word processor offered to
       automatically create a bibliographic entry.
     * Patrick keeps a list of his scientific publications on his web site.
       He would like to provide structure within this publications page so
       that Frank can automatically extract this information and use it to
       cite Patrick's papers without having to transcribe the bibliographic
       information.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he has published
       to add it to their bibliographic applications.
     * A scholar and teacher wants to publish scholarly documents or content
       that includes extensive citations that readers can then automatically
       extract so that they can find them in their local university library.
       These citations may be for a wide range of different sources: an
       interview posted on YouTube, a legal opinion posted on the Supreme
       Court web site, a press release from the White House.

   REQUIREMENTS:
     * Machine-readable bibliographic information shouldn't be on a separate
       page than human-readable bibliographic information.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, BibTex) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

The first two use cases can basically be done today using the hCard and 
hCalendar Microformats, but the parsing rules for these Microformats are 
somewhat vague, and they aren't easily extensible without hardcoding 
extensions into parsers.

I propose, therefore, to take the hCard and vCalendar vocabularies, and 
recast them onto the new microdata model.

   http://www.whatwg.org/specs/web-apps/current-work/#vcard
   http://www.whatwg.org/specs/web-apps/current-work/#vevent

I have used the knowledge and experience collected and carefully 
documented by the Microformats team on their wiki, and written a direct 
mapping of those vocabularies to microdata, along with very explicit 
definitions for how to convert this data to vCard and iCalendar files, 
something which was lacking in the hCard and hCalendar definitions:

   http://www.whatwg.org/specs/web-apps/current-work/#vcard-0
   http://www.whatwg.org/specs/web-apps/current-work/#icalendar

The third use case requires a vocabulary for citations, which isn't 
something for which a widely deployed solution exists in text/html yet. 

There are a large number of options:

 - Refer
 - RIS
 - BibTeX
 - Metadata Object Description Schema
 - Z39.80
 - Dublin Core and variants thereof
 - part of Journal Publishing Tag Set Tag Library
 - part of XML Resume
 - part of OOXML
 - part of ODF
 - part of DocBook
 - the Ann Arbor District Library XML format
 - SRU
 - My alma mater's format (University of Bath reference type)
 - Bibliontology
 - The Citation Oriented Bibliographic Vocabulary
 - ISBD
 - OpenURL COinS

...and many more.

A case could probably be made for any one of these. Based on availability 
of tools, simplicity in the format (just name-value pairs vs deeply nested 
trees of typed data), actual use in citation-happy fields, extensibility, 
use of an understandable vocabulary (e.g. "author" vs "%A"), etc, I ended 
up picking the BibTeX vocabulary. It isn't perfect; for example, it's not 
going to be a great solution for citing YouTube clips yet. But since it is 
relatively easy to extend (and indeed, it has historically been extended 
by several groups), it seems like if this feature gets good adoption, we 
will be able to extend it to support more types.

Thus, BibTeX vocabulary for microdata:

   http://www.whatwg.org/specs/web-apps/current-work/#bibtex

Exporting microdata to BibTeX:

   http://www.whatwg.org/specs/web-apps/current-work/#bibtex-0

The vocabularies and exports are pretty much useless on their own, though. 
There are two ways that make this actually useful:

 - There's a scripting API that exposes the microdata and so people can 
   write generic client-side scripts to expose data on the page, and

 - User agents are now required to export vCard, iCalendar, and BibTeX 
   when someone drags a selection that includes data marked up with those 
   vocabularies.

The latter in particular is IMHO very important. Both of these features 
require browser implementation support, which IMHO is important to making 
anything like this work widely (and has been a sore point with previous 
solutions in this space).

I shall now go through the scenarios and requirements to show how they can 
now be addressed.

   USE CASE: Exposing contact details so that users can add people to their
   address books or social networking sites.

   SCENARIOS:
     * Instead of giving a colleague a business card, someone gives their
       colleague a URL, and that colleague's user agent extracts basic
       profile information such as the person's name along with references to
       other people that person knows and adds the information into an
       address book.

This is possible today without using HTML, just make the URL point to a 
vCard text/directory resource.

     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about who he is to add it to
       their contact databases.

This is now easy -- given microdata with a vCard, the scholars need but 
drag that information to their contact databases, and assuming those 
contact databases support vCard, they can import the information directly. 
Alternatively, a script can be written in less than 200 lines of code to 
convert the microdata to vCard (or other formats) for direct download. (I 
wrote proof-of-concept scripts using the APIs in the spec to export vCard, 
vEvent, and BibTeX data. The vCard one was about 140 lines; the BibTeX one 
was about 60 lines. The vEvent one is in the spec as an example -- search 
for getCalendar() -- and is less than 40 lines.)

     * Fred copies the names of one of his Facebook friends and pastes it
       into his OS address book; the contact information is imported
       automatically.

Assuming the OS address book supports vCard, this is now supported 
natively -- all Facebook has to do is encode the data as vCard microdata.

     * Fred copies the names of one of his Facebook friends and pastes it
       into his Webmail's address book feature; the contact information is
       imported automatically.

If his Webmail supports HTML5 drag and drop (copy-and-paste is defined in 
terms of drag-and-drop), then an HTML5 user agent will include all the 
microdata of the copied selection in a JSON blob, including the vCard 
data. (Actual vCard will also be included.) This is now thus automatically 
supported assuming that the sites both use the same vocabulary, implement 
the drag-and-drop API, and the user has an HTML5 browser.

     * David can use the data in a web page to generate a custom browser UI
       for including a person in our address book without using brittle
       screen-scraping.

The spec defines exactly how to get a vCard out of a random HTML page, so 
screen-scraping should no longer be necessary.

   REQUIREMENTS:
     * A user joining a new social network should be able to identify himself
       to the new social network in way that enables the new social network
       to bootstrap his account from existing published data (e.g. from
       another social nework) rather than having to re-enter it, without the
       new site having to coordinate (or know about) the pre-existing site,
       without the user having to give either sites credentials to the other,
       and without the new site finding out about relationships that the user
       has intentionally kept secret.
       (http://w2spconf.com/2008/papers/s3p2.pdf)

Assuming both sites support the same vocabulary and can identify people 
uniquely somehow, this is now possible using microdata (just as it has 
been possible using custom microformat-like vocabularies before, or RDFa 
and other embedded data formats before). Whether sites will support this 
is up to the sites in question; I see no way to force the issue.

As far as I can tell the privacy problem listed above is not intrinsicly 
solved by the microdata solution. I cannot find a solution to those 
problems at the HTML level; they seem inherently application-bound.

     * Data should not need to be duplicated between machine-readable and
       human-readable forms (i.e. the human-readable form should be
       machine-readable).

By and large, this is met. For some of the more esoteric vEvent features 
(like repeating rules) I have opted for not really supporting them 
natively, but just allowing authors to use the vEvent rules directly. This 
is not really an issue as far as I can tell because those features aren't 
widely used (and even seem to be getting dropped in the newer version of 
iCalendar).

     * Shouldn't require the consumer to write XSLT or server-side code to
       read the contact information.

While it's possible for people to write custom code to process this data, 
the spec requires browsers to support this natively, making this 
unnecessary for these vocabularies.

     * Machine-readable contact information shouldn't be on a separate page
       than human-readable contact information.

This requirement is met.

     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, vCard) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.

I haven't defined a way to convert this data to XML, but I have provided 
explicit ways to convert to JSON, RDF, and vCard.

     * Should be possible for different parts of a contact to be given in
       different parts of the page. For example, a page with contact details
       for people in columns (with each row giving the name, telephone
       number, etc) should still have unambiguous grouped contact details
       parseable from it.

Using subject="", this is possible.

     * Parsing rules should be unambiguous.

I hope the parsing rules described in the spec are clear enough. Please 
let me know if there are any problems.

     * Should not require changes to HTML5 parsing rules.

The HTML5 parsing rules did not change.

   USE CASE: Exposing calendar events so that users can add those events to
   their calendaring systems.

   SCENARIOS:
     * A user visits the Avenue Q site and wants to make a note of when
       tickets go on sale for the tour's stop in his home town. The site says
       "October 3rd", so the user clicks this and selects "add to calendar",
       which causes an entry to be added to his calendar.

As demonstrated in the spec, it is not relatively easy to expose this data 
and requires little code to convert this data into a form supported by 
most calendars. In addition, this can also be supported using 
copy-and-paste or drag-and-drop if the source, destination, and browser 
all cooperate according to the spec.

     * A student is making a timeline of important events in Apple's history.
       As he reads Wikipedia entries on the topic, he clicks on dates and
       selects "add to timeline", which causes an entry to be added to his
       timeline.

I couldn't find a way to address this as described unless Wikipedia and 
the timeline utility cooperated directly. (Drag-and-drop and copy-and- 
paste cases can be easily supported, though.)

     * TV guide listings - browsers should be able to expose to the user's
       tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.

Assuming TV guide listings can be described in vEvent form, this is now 
possible using drag-and-drop and copy-and-paste.

     * Paul sometimes gives talks on various topics, and announces them on
       his blog. He would like to mark up these announcements with proper
       scheduling information, so that his readers' software can
       automatically obtain the scheduling information and add it to their
       calendar. Importantly, some of the rendered data might be more
       informal than the machine-readable data required to produce a calendar
       event.

This seems easily handled now.

     * David can use the data in a web page to generate a custom browser UI
       for adding an event to our calendaring software without using brittle
       screen-scraping.

The example in the spec demonstrates that this is now possible with 
relatively little code.

     * http://livebrum.co.uk/: the author would like people to be able to
       grab events and event listings from his site and put them on their
       site with as much information as possible retained. "The fantasy would
       be that I could provide code that could be cut and pasted into someone
       else's HTML so the average blogger could re-use and re-share my data."

I have included an example in the spec from livebrum.co.uk showing how 
this is possible.

     * User should be able to subscribe to http://livebrum.co.uk/ then sort
       by date and see the items sorted by event date, not publication date.

This isn't directly possible, but if a tool exists that can sort event 
data by date, then given the event data it seems possible to do this 
easily. For example, a Web Calendar product could support parsing 
microdata vEvents out of a Web page and then could offer to subscribe to 
such a page as a feed.

   REQUIREMENTS:
     * Should be discoverable.

This isn't met by the microdata vEvent vocabulary intrinsically. I expect 
that a convention will arise where people put little icons near their 
microdata saying "look, we have vEvent data you can drag to your 
calendar!" or some such.

     * Should be compatible with existing calendar systems.

The vEvent part of iCalendar is well established, so this seems met, at 
least in principle. The details (e.g. drag and drop support) probably need 
some work.

     * Should be unlikely to get out of sync with prose on the page.

By making the prose on the page the source for the microdata, this seems 
resolved.

     * Shouldn't require the consumer to write XSLT or server-side code to
       read the calendar information.

This is mostly met in the same way as for contact data.

     * Machine-readable event data shouldn't be on a separate page than
       human-readable dates.

This is achieved using inline microdata.

     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, iCalendar) in a consistent manner, so that tools that use
       this information separate from the pages on which it is found have a
       standard way of conveying the information.

Output in all those formats except raw XML is explicitly supported in the 
spec.

     * Should be possible for different parts of an event to be given in
       different parts of the page. For example, a page with calendar events
       in columns (with each row giving the time, date, place, etc) should
       still have unambiguous calendar events parseable from it.

subject="" supports this.

     * Should be possible for authors to find out if people are reusing the
       information on their site.

This isn't met. I couldn't find a good way to do this. When JavaScript is 
enabled, drag-and-drop, copy-and-paste, and other mechanisms can be 
detected and logged via script, but really there's no good way to detect 
all uses of microdata. (Providing a ping=""-like feature for this seems 
like overkill and wouldn't help with non-end-user use anyway.)

     * Code should not be ugly (e.g. should not be mixed in with markup used
       mostly for styling).

This appears to be met.

     * There should be "obvious parsing tools for people to actually do
       anything with the data (other than add an event to a calendar)".

There aren't any obvious tools yet, but since two separate implementations 
arose in less than 24 hours from the point where the microdata stuff was 
released, it seems like this will prove easy enough to do.

     * Solution should not feel "disconnected" from the Web the way that
       calendar file downloads do.

This seems met.

     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

The same applies here as with vCard.

   USE CASE: Allow users to maintain bibliographies or otherwise keep track
   of sources of quotes or references.

   SCENARIOS:
     * Frank copies a sentence from Wikipedia and pastes it in some word
       processor: it would be great if the word processor offered to
       automatically create a bibliographic entry.

This will require new code in the word processor, but the information, in 
an HTML5-compliant browser according to this proposal, would include the 
information required to do this.

     * Patrick keeps a list of his scientific publications on his web site.
       He would like to provide structure within this publications page so
       that Frank can automatically extract this information and use it to
       cite Patrick's papers without having to transcribe the bibliographic
       information.

This seems to be handled directly now if the page is written using the 
BibTeX vocabulary.

     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he has published
       to add it to their bibliographic applications.

This seems met in the same way.

     * A scholar and teacher wants to publish scholarly documents or content
       that includes extensive citations that readers can then automatically
       extract so that they can find them in their local university library.
       These citations may be for a wide range of different sources: an
       interview posted on YouTube, a legal opinion posted on the Supreme
       Court web site, a press release from the White House.

Not all of these types are immediately supported by the BibTeX vocabulary. 
I recommend that we extend the BibTeX set over time if this feature gains 
a critical mass.

   REQUIREMENTS:
     * Machine-readable bibliographic information shouldn't be on a separate
       page than human-readable bibliographic information.

This is met.

     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, BibTex) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.

This is met explicitly for three of those types; for other types it can 
be done easily enough also though it is not defined in the spec.

     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.

These are met in the same way as with vCard and vEvent microdata.

In conclusion, to address these use cases and scenarios I've introduced 
three vocabularies based on past practices -- vCard, vEvent, and BibTeX -- 
to the HTML5 specification, and I've defined how these vocabularies work 
in the context of the drag-and-drop model, which I believe is the core 
part of this proposal that has been lacking in other proposals previously.

A number of further use cases remain to be examined, including one with 
scenarios regarding validating custom vocabularies and allowing editors to 
provide help with custom vocabularies. I will send further e-mail next 
week as I address them.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'