[whatwg] Exposing known data types in a reusable way
Ian Hickson
ian at hixie.ch
Tue May 19 16:07:15 PDT 2009
Some of the use cases I collected from the e-mails sent in over the past
few months were the following:
USE CASE: Exposing contact details so that users can add people to their
address books or social networking sites.
SCENARIOS:
* Instead of giving a colleague a business card, someone gives their
colleague a URL, and that colleague's user agent extracts basic
profile information such as the person's name along with references to
other people that person knows and adds the information into an
address book.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about who he is to add it to
their contact databases.
* Fred copies the names of one of his Facebook friends and pastes it
into his OS address book; the contact information is imported
automatically.
* Fred copies the names of one of his Facebook friends and pastes it
into his Webmail's address book feature; the contact information is
imported automatically.
* David can use the data in a web page to generate a custom browser UI
for including a person in our address book without using brittle
screen-scraping.
REQUIREMENTS:
* A user joining a new social network should be able to identify himself
to the new social network in way that enables the new social network
to bootstrap his account from existing published data (e.g. from
another social nework) rather than having to re-enter it, without the
new site having to coordinate (or know about) the pre-existing site,
without the user having to give either sites credentials to the other,
and without the new site finding out about relationships that the user
has intentionally kept secret.
(http://w2spconf.com/2008/papers/s3p2.pdf)
* Data should not need to be duplicated between machine-readable and
human-readable forms (i.e. the human-readable form should be
machine-readable).
* Shouldn't require the consumer to write XSLT or server-side code to
read the contact information.
* Machine-readable contact information shouldn't be on a separate page
than human-readable contact information.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, vCard) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
* Should be possible for different parts of a contact to be given in
different parts of the page. For example, a page with contact details
for people in columns (with each row giving the name, telephone
number, etc) should still have unambiguous grouped contact details
parseable from it.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
USE CASE: Exposing calendar events so that users can add those events to
their calendaring systems.
SCENARIOS:
* A user visits the Avenue Q site and wants to make a note of when
tickets go on sale for the tour's stop in his home town. The site says
"October 3rd", so the user clicks this and selects "add to calendar",
which causes an entry to be added to his calendar.
* A student is making a timeline of important events in Apple's history.
As he reads Wikipedia entries on the topic, he clicks on dates and
selects "add to timeline", which causes an entry to be added to his
timeline.
* TV guide listings - browsers should be able to expose to the user's
tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
* Paul sometimes gives talks on various topics, and announces them on
his blog. He would like to mark up these announcements with proper
scheduling information, so that his readers' software can
automatically obtain the scheduling information and add it to their
calendar. Importantly, some of the rendered data might be more
informal than the machine-readable data required to produce a calendar
event.
* David can use the data in a web page to generate a custom browser UI
for adding an event to our calendaring software without using brittle
screen-scraping.
* http://livebrum.co.uk/: the author would like people to be able to
grab events and event listings from his site and put them on their
site with as much information as possible retained. "The fantasy would
be that I could provide code that could be cut and pasted into someone
else's HTML so the average blogger could re-use and re-share my data."
* User should be able to subscribe to http://livebrum.co.uk/ then sort
by date and see the items sorted by event date, not publication date.
REQUIREMENTS:
* Should be discoverable.
* Should be compatible with existing calendar systems.
* Should be unlikely to get out of sync with prose on the page.
* Shouldn't require the consumer to write XSLT or server-side code to
read the calendar information.
* Machine-readable event data shouldn't be on a separate page than
human-readable dates.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, iCalendar) in a consistent manner, so that tools that use
this information separate from the pages on which it is found have a
standard way of conveying the information.
* Should be possible for different parts of an event to be given in
different parts of the page. For example, a page with calendar events
in columns (with each row giving the time, date, place, etc) should
still have unambiguous calendar events parseable from it.
* Should be possible for authors to find out if people are reusing the
information on their site.
* Code should not be ugly (e.g. should not be mixed in with markup used
mostly for styling).
* There should be "obvious parsing tools for people to actually do
anything with the data (other than add an event to a calendar)".
* Solution should not feel "disconnected" from the Web the way that
calendar file downloads do.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
USE CASE: Allow users to maintain bibliographies or otherwise keep track
of sources of quotes or references.
SCENARIOS:
* Frank copies a sentence from Wikipedia and pastes it in some word
processor: it would be great if the word processor offered to
automatically create a bibliographic entry.
* Patrick keeps a list of his scientific publications on his web site.
He would like to provide structure within this publications page so
that Frank can automatically extract this information and use it to
cite Patrick's papers without having to transcribe the bibliographic
information.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about what he has published
to add it to their bibliographic applications.
* A scholar and teacher wants to publish scholarly documents or content
that includes extensive citations that readers can then automatically
extract so that they can find them in their local university library.
These citations may be for a wide range of different sources: an
interview posted on YouTube, a legal opinion posted on the Supreme
Court web site, a press release from the White House.
REQUIREMENTS:
* Machine-readable bibliographic information shouldn't be on a separate
page than human-readable bibliographic information.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, BibTex) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
The first two use cases can basically be done today using the hCard and
hCalendar Microformats, but the parsing rules for these Microformats are
somewhat vague, and they aren't easily extensible without hardcoding
extensions into parsers.
I propose, therefore, to take the hCard and vCalendar vocabularies, and
recast them onto the new microdata model.
http://www.whatwg.org/specs/web-apps/current-work/#vcard
http://www.whatwg.org/specs/web-apps/current-work/#vevent
I have used the knowledge and experience collected and carefully
documented by the Microformats team on their wiki, and written a direct
mapping of those vocabularies to microdata, along with very explicit
definitions for how to convert this data to vCard and iCalendar files,
something which was lacking in the hCard and hCalendar definitions:
http://www.whatwg.org/specs/web-apps/current-work/#vcard-0
http://www.whatwg.org/specs/web-apps/current-work/#icalendar
The third use case requires a vocabulary for citations, which isn't
something for which a widely deployed solution exists in text/html yet.
There are a large number of options:
- Refer
- RIS
- BibTeX
- Metadata Object Description Schema
- Z39.80
- Dublin Core and variants thereof
- part of Journal Publishing Tag Set Tag Library
- part of XML Resume
- part of OOXML
- part of ODF
- part of DocBook
- the Ann Arbor District Library XML format
- SRU
- My alma mater's format (University of Bath reference type)
- Bibliontology
- The Citation Oriented Bibliographic Vocabulary
- ISBD
- OpenURL COinS
...and many more.
A case could probably be made for any one of these. Based on availability
of tools, simplicity in the format (just name-value pairs vs deeply nested
trees of typed data), actual use in citation-happy fields, extensibility,
use of an understandable vocabulary (e.g. "author" vs "%A"), etc, I ended
up picking the BibTeX vocabulary. It isn't perfect; for example, it's not
going to be a great solution for citing YouTube clips yet. But since it is
relatively easy to extend (and indeed, it has historically been extended
by several groups), it seems like if this feature gets good adoption, we
will be able to extend it to support more types.
Thus, BibTeX vocabulary for microdata:
http://www.whatwg.org/specs/web-apps/current-work/#bibtex
Exporting microdata to BibTeX:
http://www.whatwg.org/specs/web-apps/current-work/#bibtex-0
The vocabularies and exports are pretty much useless on their own, though.
There are two ways that make this actually useful:
- There's a scripting API that exposes the microdata and so people can
write generic client-side scripts to expose data on the page, and
- User agents are now required to export vCard, iCalendar, and BibTeX
when someone drags a selection that includes data marked up with those
vocabularies.
The latter in particular is IMHO very important. Both of these features
require browser implementation support, which IMHO is important to making
anything like this work widely (and has been a sore point with previous
solutions in this space).
I shall now go through the scenarios and requirements to show how they can
now be addressed.
USE CASE: Exposing contact details so that users can add people to their
address books or social networking sites.
SCENARIOS:
* Instead of giving a colleague a business card, someone gives their
colleague a URL, and that colleague's user agent extracts basic
profile information such as the person's name along with references to
other people that person knows and adds the information into an
address book.
This is possible today without using HTML, just make the URL point to a
vCard text/directory resource.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about who he is to add it to
their contact databases.
This is now easy -- given microdata with a vCard, the scholars need but
drag that information to their contact databases, and assuming those
contact databases support vCard, they can import the information directly.
Alternatively, a script can be written in less than 200 lines of code to
convert the microdata to vCard (or other formats) for direct download. (I
wrote proof-of-concept scripts using the APIs in the spec to export vCard,
vEvent, and BibTeX data. The vCard one was about 140 lines; the BibTeX one
was about 60 lines. The vEvent one is in the spec as an example -- search
for getCalendar() -- and is less than 40 lines.)
* Fred copies the names of one of his Facebook friends and pastes it
into his OS address book; the contact information is imported
automatically.
Assuming the OS address book supports vCard, this is now supported
natively -- all Facebook has to do is encode the data as vCard microdata.
* Fred copies the names of one of his Facebook friends and pastes it
into his Webmail's address book feature; the contact information is
imported automatically.
If his Webmail supports HTML5 drag and drop (copy-and-paste is defined in
terms of drag-and-drop), then an HTML5 user agent will include all the
microdata of the copied selection in a JSON blob, including the vCard
data. (Actual vCard will also be included.) This is now thus automatically
supported assuming that the sites both use the same vocabulary, implement
the drag-and-drop API, and the user has an HTML5 browser.
* David can use the data in a web page to generate a custom browser UI
for including a person in our address book without using brittle
screen-scraping.
The spec defines exactly how to get a vCard out of a random HTML page, so
screen-scraping should no longer be necessary.
REQUIREMENTS:
* A user joining a new social network should be able to identify himself
to the new social network in way that enables the new social network
to bootstrap his account from existing published data (e.g. from
another social nework) rather than having to re-enter it, without the
new site having to coordinate (or know about) the pre-existing site,
without the user having to give either sites credentials to the other,
and without the new site finding out about relationships that the user
has intentionally kept secret.
(http://w2spconf.com/2008/papers/s3p2.pdf)
Assuming both sites support the same vocabulary and can identify people
uniquely somehow, this is now possible using microdata (just as it has
been possible using custom microformat-like vocabularies before, or RDFa
and other embedded data formats before). Whether sites will support this
is up to the sites in question; I see no way to force the issue.
As far as I can tell the privacy problem listed above is not intrinsicly
solved by the microdata solution. I cannot find a solution to those
problems at the HTML level; they seem inherently application-bound.
* Data should not need to be duplicated between machine-readable and
human-readable forms (i.e. the human-readable form should be
machine-readable).
By and large, this is met. For some of the more esoteric vEvent features
(like repeating rules) I have opted for not really supporting them
natively, but just allowing authors to use the vEvent rules directly. This
is not really an issue as far as I can tell because those features aren't
widely used (and even seem to be getting dropped in the newer version of
iCalendar).
* Shouldn't require the consumer to write XSLT or server-side code to
read the contact information.
While it's possible for people to write custom code to process this data,
the spec requires browsers to support this natively, making this
unnecessary for these vocabularies.
* Machine-readable contact information shouldn't be on a separate page
than human-readable contact information.
This requirement is met.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, vCard) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
I haven't defined a way to convert this data to XML, but I have provided
explicit ways to convert to JSON, RDF, and vCard.
* Should be possible for different parts of a contact to be given in
different parts of the page. For example, a page with contact details
for people in columns (with each row giving the name, telephone
number, etc) should still have unambiguous grouped contact details
parseable from it.
Using subject="", this is possible.
* Parsing rules should be unambiguous.
I hope the parsing rules described in the spec are clear enough. Please
let me know if there are any problems.
* Should not require changes to HTML5 parsing rules.
The HTML5 parsing rules did not change.
USE CASE: Exposing calendar events so that users can add those events to
their calendaring systems.
SCENARIOS:
* A user visits the Avenue Q site and wants to make a note of when
tickets go on sale for the tour's stop in his home town. The site says
"October 3rd", so the user clicks this and selects "add to calendar",
which causes an entry to be added to his calendar.
As demonstrated in the spec, it is not relatively easy to expose this data
and requires little code to convert this data into a form supported by
most calendars. In addition, this can also be supported using
copy-and-paste or drag-and-drop if the source, destination, and browser
all cooperate according to the spec.
* A student is making a timeline of important events in Apple's history.
As he reads Wikipedia entries on the topic, he clicks on dates and
selects "add to timeline", which causes an entry to be added to his
timeline.
I couldn't find a way to address this as described unless Wikipedia and
the timeline utility cooperated directly. (Drag-and-drop and copy-and-
paste cases can be easily supported, though.)
* TV guide listings - browsers should be able to expose to the user's
tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
Assuming TV guide listings can be described in vEvent form, this is now
possible using drag-and-drop and copy-and-paste.
* Paul sometimes gives talks on various topics, and announces them on
his blog. He would like to mark up these announcements with proper
scheduling information, so that his readers' software can
automatically obtain the scheduling information and add it to their
calendar. Importantly, some of the rendered data might be more
informal than the machine-readable data required to produce a calendar
event.
This seems easily handled now.
* David can use the data in a web page to generate a custom browser UI
for adding an event to our calendaring software without using brittle
screen-scraping.
The example in the spec demonstrates that this is now possible with
relatively little code.
* http://livebrum.co.uk/: the author would like people to be able to
grab events and event listings from his site and put them on their
site with as much information as possible retained. "The fantasy would
be that I could provide code that could be cut and pasted into someone
else's HTML so the average blogger could re-use and re-share my data."
I have included an example in the spec from livebrum.co.uk showing how
this is possible.
* User should be able to subscribe to http://livebrum.co.uk/ then sort
by date and see the items sorted by event date, not publication date.
This isn't directly possible, but if a tool exists that can sort event
data by date, then given the event data it seems possible to do this
easily. For example, a Web Calendar product could support parsing
microdata vEvents out of a Web page and then could offer to subscribe to
such a page as a feed.
REQUIREMENTS:
* Should be discoverable.
This isn't met by the microdata vEvent vocabulary intrinsically. I expect
that a convention will arise where people put little icons near their
microdata saying "look, we have vEvent data you can drag to your
calendar!" or some such.
* Should be compatible with existing calendar systems.
The vEvent part of iCalendar is well established, so this seems met, at
least in principle. The details (e.g. drag and drop support) probably need
some work.
* Should be unlikely to get out of sync with prose on the page.
By making the prose on the page the source for the microdata, this seems
resolved.
* Shouldn't require the consumer to write XSLT or server-side code to
read the calendar information.
This is mostly met in the same way as for contact data.
* Machine-readable event data shouldn't be on a separate page than
human-readable dates.
This is achieved using inline microdata.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, iCalendar) in a consistent manner, so that tools that use
this information separate from the pages on which it is found have a
standard way of conveying the information.
Output in all those formats except raw XML is explicitly supported in the
spec.
* Should be possible for different parts of an event to be given in
different parts of the page. For example, a page with calendar events
in columns (with each row giving the time, date, place, etc) should
still have unambiguous calendar events parseable from it.
subject="" supports this.
* Should be possible for authors to find out if people are reusing the
information on their site.
This isn't met. I couldn't find a good way to do this. When JavaScript is
enabled, drag-and-drop, copy-and-paste, and other mechanisms can be
detected and logged via script, but really there's no good way to detect
all uses of microdata. (Providing a ping=""-like feature for this seems
like overkill and wouldn't help with non-end-user use anyway.)
* Code should not be ugly (e.g. should not be mixed in with markup used
mostly for styling).
This appears to be met.
* There should be "obvious parsing tools for people to actually do
anything with the data (other than add an event to a calendar)".
There aren't any obvious tools yet, but since two separate implementations
arose in less than 24 hours from the point where the microdata stuff was
released, it seems like this will prove easy enough to do.
* Solution should not feel "disconnected" from the Web the way that
calendar file downloads do.
This seems met.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
The same applies here as with vCard.
USE CASE: Allow users to maintain bibliographies or otherwise keep track
of sources of quotes or references.
SCENARIOS:
* Frank copies a sentence from Wikipedia and pastes it in some word
processor: it would be great if the word processor offered to
automatically create a bibliographic entry.
This will require new code in the word processor, but the information, in
an HTML5-compliant browser according to this proposal, would include the
information required to do this.
* Patrick keeps a list of his scientific publications on his web site.
He would like to provide structure within this publications page so
that Frank can automatically extract this information and use it to
cite Patrick's papers without having to transcribe the bibliographic
information.
This seems to be handled directly now if the page is written using the
BibTeX vocabulary.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about what he has published
to add it to their bibliographic applications.
This seems met in the same way.
* A scholar and teacher wants to publish scholarly documents or content
that includes extensive citations that readers can then automatically
extract so that they can find them in their local university library.
These citations may be for a wide range of different sources: an
interview posted on YouTube, a legal opinion posted on the Supreme
Court web site, a press release from the White House.
Not all of these types are immediately supported by the BibTeX vocabulary.
I recommend that we extend the BibTeX set over time if this feature gains
a critical mass.
REQUIREMENTS:
* Machine-readable bibliographic information shouldn't be on a separate
page than human-readable bibliographic information.
This is met.
* The information should be convertible into a dedicated form (RDF,
JSON, XML, BibTex) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
This is met explicitly for three of those types; for other types it can
be done easily enough also though it is not defined in the spec.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
These are met in the same way as with vCard and vEvent microdata.
In conclusion, to address these use cases and scenarios I've introduced
three vocabularies based on past practices -- vCard, vEvent, and BibTeX --
to the HTML5 specification, and I've defined how these vocabularies work
in the context of the drag-and-drop model, which I believe is the core
part of this proposal that has been lacking in other proposals previously.
A number of further use cases remain to be examined, including one with
scenarios regarding validating custom vocabularies and allowing editors to
provide help with custom vocabularies. I will send further e-mail next
week as I address them.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list