[whatwg] Exposing known data types in a reusable way

Eduard Pascual herenvardo at gmail.com
Thu May 21 04:57:15 PDT 2009


Interesting.
Despite my PoV against the microdata proposal, I've taken a look at it
and find a minor typo:

Within "5.4.1 vCard", by the end of the "n" property description, the
spec reads:
"The value of the fn property a name in one of the following forms:"
shouldn't it read:
"The value of the fn property is a name in one of the following forms:" ?

Maybe this will grant me a seat for posterity on the acknowledgements
section =P.

On Wed, May 20, 2009 at 1:07 AM, Ian Hickson <ian at hixie.ch> wrote:
>
> Some of the use cases I collected from the e-mails sent in over the past
> few months were the following:
>
>   USE CASE: Exposing contact details so that users can add people to their
>   address books or social networking sites.
>
>   SCENARIOS:
>     * Instead of giving a colleague a business card, someone gives their
>       colleague a URL, and that colleague's user agent extracts basic
>       profile information such as the person's name along with references to
>       other people that person knows and adds the information into an
>       address book.
>     * A scholar and teacher wants other scholars (and potentially students)
>       to be able to easily extract information about who he is to add it to
>       their contact databases.
>     * Fred copies the names of one of his Facebook friends and pastes it
>       into his OS address book; the contact information is imported
>       automatically.
>     * Fred copies the names of one of his Facebook friends and pastes it
>       into his Webmail's address book feature; the contact information is
>       imported automatically.
>     * David can use the data in a web page to generate a custom browser UI
>       for including a person in our address book without using brittle
>       screen-scraping.
>
>   REQUIREMENTS:
>     * A user joining a new social network should be able to identify himself
>       to the new social network in way that enables the new social network
>       to bootstrap his account from existing published data (e.g. from
>       another social nework) rather than having to re-enter it, without the
>       new site having to coordinate (or know about) the pre-existing site,
>       without the user having to give either sites credentials to the other,
>       and without the new site finding out about relationships that the user
>       has intentionally kept secret.
>       (http://w2spconf.com/2008/papers/s3p2.pdf)
>     * Data should not need to be duplicated between machine-readable and
>       human-readable forms (i.e. the human-readable form should be
>       machine-readable).
>     * Shouldn't require the consumer to write XSLT or server-side code to
>       read the contact information.
>     * Machine-readable contact information shouldn't be on a separate page
>       than human-readable contact information.
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, vCard) in a consistent manner, so that tools that use this
>       information separate from the pages on which it is found have a
>       standard way of conveying the information.
>     * Should be possible for different parts of a contact to be given in
>       different parts of the page. For example, a page with contact details
>       for people in columns (with each row giving the name, telephone
>       number, etc) should still have unambiguous grouped contact details
>       parseable from it.
>     * Parsing rules should be unambiguous.
>     * Should not require changes to HTML5 parsing rules.
>
>
>   USE CASE: Exposing calendar events so that users can add those events to
>   their calendaring systems.
>
>   SCENARIOS:
>     * A user visits the Avenue Q site and wants to make a note of when
>       tickets go on sale for the tour's stop in his home town. The site says
>       "October 3rd", so the user clicks this and selects "add to calendar",
>       which causes an entry to be added to his calendar.
>     * A student is making a timeline of important events in Apple's history.
>       As he reads Wikipedia entries on the topic, he clicks on dates and
>       selects "add to timeline", which causes an entry to be added to his
>       timeline.
>     * TV guide listings - browsers should be able to expose to the user's
>       tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
>     * Paul sometimes gives talks on various topics, and announces them on
>       his blog. He would like to mark up these announcements with proper
>       scheduling information, so that his readers' software can
>       automatically obtain the scheduling information and add it to their
>       calendar. Importantly, some of the rendered data might be more
>       informal than the machine-readable data required to produce a calendar
>       event.
>     * David can use the data in a web page to generate a custom browser UI
>       for adding an event to our calendaring software without using brittle
>       screen-scraping.
>     * http://livebrum.co.uk/: the author would like people to be able to
>       grab events and event listings from his site and put them on their
>       site with as much information as possible retained. "The fantasy would
>       be that I could provide code that could be cut and pasted into someone
>       else's HTML so the average blogger could re-use and re-share my data."
>     * User should be able to subscribe to http://livebrum.co.uk/ then sort
>       by date and see the items sorted by event date, not publication date.
>
>   REQUIREMENTS:
>     * Should be discoverable.
>     * Should be compatible with existing calendar systems.
>     * Should be unlikely to get out of sync with prose on the page.
>     * Shouldn't require the consumer to write XSLT or server-side code to
>       read the calendar information.
>     * Machine-readable event data shouldn't be on a separate page than
>       human-readable dates.
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, iCalendar) in a consistent manner, so that tools that use
>       this information separate from the pages on which it is found have a
>       standard way of conveying the information.
>     * Should be possible for different parts of an event to be given in
>       different parts of the page. For example, a page with calendar events
>       in columns (with each row giving the time, date, place, etc) should
>       still have unambiguous calendar events parseable from it.
>     * Should be possible for authors to find out if people are reusing the
>       information on their site.
>     * Code should not be ugly (e.g. should not be mixed in with markup used
>       mostly for styling).
>     * There should be "obvious parsing tools for people to actually do
>       anything with the data (other than add an event to a calendar)".
>     * Solution should not feel "disconnected" from the Web the way that
>       calendar file downloads do.
>     * Parsing rules should be unambiguous.
>     * Should not require changes to HTML5 parsing rules.
>
>
>   USE CASE: Allow users to maintain bibliographies or otherwise keep track
>   of sources of quotes or references.
>
>   SCENARIOS:
>     * Frank copies a sentence from Wikipedia and pastes it in some word
>       processor: it would be great if the word processor offered to
>       automatically create a bibliographic entry.
>     * Patrick keeps a list of his scientific publications on his web site.
>       He would like to provide structure within this publications page so
>       that Frank can automatically extract this information and use it to
>       cite Patrick's papers without having to transcribe the bibliographic
>       information.
>     * A scholar and teacher wants other scholars (and potentially students)
>       to be able to easily extract information about what he has published
>       to add it to their bibliographic applications.
>     * A scholar and teacher wants to publish scholarly documents or content
>       that includes extensive citations that readers can then automatically
>       extract so that they can find them in their local university library.
>       These citations may be for a wide range of different sources: an
>       interview posted on YouTube, a legal opinion posted on the Supreme
>       Court web site, a press release from the White House.
>
>   REQUIREMENTS:
>     * Machine-readable bibliographic information shouldn't be on a separate
>       page than human-readable bibliographic information.
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, BibTex) in a consistent manner, so that tools that use this
>       information separate from the pages on which it is found have a
>       standard way of conveying the information.
>     * Parsing rules should be unambiguous.
>     * Should not require changes to HTML5 parsing rules.
>
>
> The first two use cases can basically be done today using the hCard and
> hCalendar Microformats, but the parsing rules for these Microformats are
> somewhat vague, and they aren't easily extensible without hardcoding
> extensions into parsers.
>
> I propose, therefore, to take the hCard and vCalendar vocabularies, and
> recast them onto the new microdata model.
>
>   http://www.whatwg.org/specs/web-apps/current-work/#vcard
>   http://www.whatwg.org/specs/web-apps/current-work/#vevent
>
> I have used the knowledge and experience collected and carefully
> documented by the Microformats team on their wiki, and written a direct
> mapping of those vocabularies to microdata, along with very explicit
> definitions for how to convert this data to vCard and iCalendar files,
> something which was lacking in the hCard and hCalendar definitions:
>
>   http://www.whatwg.org/specs/web-apps/current-work/#vcard-0
>   http://www.whatwg.org/specs/web-apps/current-work/#icalendar
>
> The third use case requires a vocabulary for citations, which isn't
> something for which a widely deployed solution exists in text/html yet.
>
> There are a large number of options:
>
>  - Refer
>  - RIS
>  - BibTeX
>  - Metadata Object Description Schema
>  - Z39.80
>  - Dublin Core and variants thereof
>  - part of Journal Publishing Tag Set Tag Library
>  - part of XML Resume
>  - part of OOXML
>  - part of ODF
>  - part of DocBook
>  - the Ann Arbor District Library XML format
>  - SRU
>  - My alma mater's format (University of Bath reference type)
>  - Bibliontology
>  - The Citation Oriented Bibliographic Vocabulary
>  - ISBD
>  - OpenURL COinS
>
> ...and many more.
>
> A case could probably be made for any one of these. Based on availability
> of tools, simplicity in the format (just name-value pairs vs deeply nested
> trees of typed data), actual use in citation-happy fields, extensibility,
> use of an understandable vocabulary (e.g. "author" vs "%A"), etc, I ended
> up picking the BibTeX vocabulary. It isn't perfect; for example, it's not
> going to be a great solution for citing YouTube clips yet. But since it is
> relatively easy to extend (and indeed, it has historically been extended
> by several groups), it seems like if this feature gets good adoption, we
> will be able to extend it to support more types.
>
> Thus, BibTeX vocabulary for microdata:
>
>   http://www.whatwg.org/specs/web-apps/current-work/#bibtex
>
> Exporting microdata to BibTeX:
>
>   http://www.whatwg.org/specs/web-apps/current-work/#bibtex-0
>
>
> The vocabularies and exports are pretty much useless on their own, though.
> There are two ways that make this actually useful:
>
>  - There's a scripting API that exposes the microdata and so people can
>   write generic client-side scripts to expose data on the page, and
>
>  - User agents are now required to export vCard, iCalendar, and BibTeX
>   when someone drags a selection that includes data marked up with those
>   vocabularies.
>
> The latter in particular is IMHO very important. Both of these features
> require browser implementation support, which IMHO is important to making
> anything like this work widely (and has been a sore point with previous
> solutions in this space).
>
>
> I shall now go through the scenarios and requirements to show how they can
> now be addressed.
>
>   USE CASE: Exposing contact details so that users can add people to their
>   address books or social networking sites.
>
>   SCENARIOS:
>     * Instead of giving a colleague a business card, someone gives their
>       colleague a URL, and that colleague's user agent extracts basic
>       profile information such as the person's name along with references to
>       other people that person knows and adds the information into an
>       address book.
>
> This is possible today without using HTML, just make the URL point to a
> vCard text/directory resource.
>
>
>     * A scholar and teacher wants other scholars (and potentially students)
>       to be able to easily extract information about who he is to add it to
>       their contact databases.
>
> This is now easy -- given microdata with a vCard, the scholars need but
> drag that information to their contact databases, and assuming those
> contact databases support vCard, they can import the information directly.
> Alternatively, a script can be written in less than 200 lines of code to
> convert the microdata to vCard (or other formats) for direct download. (I
> wrote proof-of-concept scripts using the APIs in the spec to export vCard,
> vEvent, and BibTeX data. The vCard one was about 140 lines; the BibTeX one
> was about 60 lines. The vEvent one is in the spec as an example -- search
> for getCalendar() -- and is less than 40 lines.)
>
>
>     * Fred copies the names of one of his Facebook friends and pastes it
>       into his OS address book; the contact information is imported
>       automatically.
>
> Assuming the OS address book supports vCard, this is now supported
> natively -- all Facebook has to do is encode the data as vCard microdata.
>
>
>     * Fred copies the names of one of his Facebook friends and pastes it
>       into his Webmail's address book feature; the contact information is
>       imported automatically.
>
> If his Webmail supports HTML5 drag and drop (copy-and-paste is defined in
> terms of drag-and-drop), then an HTML5 user agent will include all the
> microdata of the copied selection in a JSON blob, including the vCard
> data. (Actual vCard will also be included.) This is now thus automatically
> supported assuming that the sites both use the same vocabulary, implement
> the drag-and-drop API, and the user has an HTML5 browser.
>
>
>     * David can use the data in a web page to generate a custom browser UI
>       for including a person in our address book without using brittle
>       screen-scraping.
>
> The spec defines exactly how to get a vCard out of a random HTML page, so
> screen-scraping should no longer be necessary.
>
>
>   REQUIREMENTS:
>     * A user joining a new social network should be able to identify himself
>       to the new social network in way that enables the new social network
>       to bootstrap his account from existing published data (e.g. from
>       another social nework) rather than having to re-enter it, without the
>       new site having to coordinate (or know about) the pre-existing site,
>       without the user having to give either sites credentials to the other,
>       and without the new site finding out about relationships that the user
>       has intentionally kept secret.
>       (http://w2spconf.com/2008/papers/s3p2.pdf)
>
> Assuming both sites support the same vocabulary and can identify people
> uniquely somehow, this is now possible using microdata (just as it has
> been possible using custom microformat-like vocabularies before, or RDFa
> and other embedded data formats before). Whether sites will support this
> is up to the sites in question; I see no way to force the issue.
>
> As far as I can tell the privacy problem listed above is not intrinsicly
> solved by the microdata solution. I cannot find a solution to those
> problems at the HTML level; they seem inherently application-bound.
>
>
>     * Data should not need to be duplicated between machine-readable and
>       human-readable forms (i.e. the human-readable form should be
>       machine-readable).
>
> By and large, this is met. For some of the more esoteric vEvent features
> (like repeating rules) I have opted for not really supporting them
> natively, but just allowing authors to use the vEvent rules directly. This
> is not really an issue as far as I can tell because those features aren't
> widely used (and even seem to be getting dropped in the newer version of
> iCalendar).
>
>
>     * Shouldn't require the consumer to write XSLT or server-side code to
>       read the contact information.
>
> While it's possible for people to write custom code to process this data,
> the spec requires browsers to support this natively, making this
> unnecessary for these vocabularies.
>
>
>     * Machine-readable contact information shouldn't be on a separate page
>       than human-readable contact information.
>
> This requirement is met.
>
>
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, vCard) in a consistent manner, so that tools that use this
>       information separate from the pages on which it is found have a
>       standard way of conveying the information.
>
> I haven't defined a way to convert this data to XML, but I have provided
> explicit ways to convert to JSON, RDF, and vCard.
>
>
>     * Should be possible for different parts of a contact to be given in
>       different parts of the page. For example, a page with contact details
>       for people in columns (with each row giving the name, telephone
>       number, etc) should still have unambiguous grouped contact details
>       parseable from it.
>
> Using subject="", this is possible.
>
>
>     * Parsing rules should be unambiguous.
>
> I hope the parsing rules described in the spec are clear enough. Please
> let me know if there are any problems.
>
>
>     * Should not require changes to HTML5 parsing rules.
>
> The HTML5 parsing rules did not change.
>
>
>   USE CASE: Exposing calendar events so that users can add those events to
>   their calendaring systems.
>
>   SCENARIOS:
>     * A user visits the Avenue Q site and wants to make a note of when
>       tickets go on sale for the tour's stop in his home town. The site says
>       "October 3rd", so the user clicks this and selects "add to calendar",
>       which causes an entry to be added to his calendar.
>
> As demonstrated in the spec, it is not relatively easy to expose this data
> and requires little code to convert this data into a form supported by
> most calendars. In addition, this can also be supported using
> copy-and-paste or drag-and-drop if the source, destination, and browser
> all cooperate according to the spec.
>
>
>     * A student is making a timeline of important events in Apple's history.
>       As he reads Wikipedia entries on the topic, he clicks on dates and
>       selects "add to timeline", which causes an entry to be added to his
>       timeline.
>
> I couldn't find a way to address this as described unless Wikipedia and
> the timeline utility cooperated directly. (Drag-and-drop and copy-and-
> paste cases can be easily supported, though.)
>
>
>     * TV guide listings - browsers should be able to expose to the user's
>       tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
>
> Assuming TV guide listings can be described in vEvent form, this is now
> possible using drag-and-drop and copy-and-paste.
>
>
>     * Paul sometimes gives talks on various topics, and announces them on
>       his blog. He would like to mark up these announcements with proper
>       scheduling information, so that his readers' software can
>       automatically obtain the scheduling information and add it to their
>       calendar. Importantly, some of the rendered data might be more
>       informal than the machine-readable data required to produce a calendar
>       event.
>
> This seems easily handled now.
>
>
>     * David can use the data in a web page to generate a custom browser UI
>       for adding an event to our calendaring software without using brittle
>       screen-scraping.
>
> The example in the spec demonstrates that this is now possible with
> relatively little code.
>
>
>     * http://livebrum.co.uk/: the author would like people to be able to
>       grab events and event listings from his site and put them on their
>       site with as much information as possible retained. "The fantasy would
>       be that I could provide code that could be cut and pasted into someone
>       else's HTML so the average blogger could re-use and re-share my data."
>
> I have included an example in the spec from livebrum.co.uk showing how
> this is possible.
>
>
>     * User should be able to subscribe to http://livebrum.co.uk/ then sort
>       by date and see the items sorted by event date, not publication date.
>
> This isn't directly possible, but if a tool exists that can sort event
> data by date, then given the event data it seems possible to do this
> easily. For example, a Web Calendar product could support parsing
> microdata vEvents out of a Web page and then could offer to subscribe to
> such a page as a feed.
>
>
>
>   REQUIREMENTS:
>     * Should be discoverable.
>
> This isn't met by the microdata vEvent vocabulary intrinsically. I expect
> that a convention will arise where people put little icons near their
> microdata saying "look, we have vEvent data you can drag to your
> calendar!" or some such.
>
>
>     * Should be compatible with existing calendar systems.
>
> The vEvent part of iCalendar is well established, so this seems met, at
> least in principle. The details (e.g. drag and drop support) probably need
> some work.
>
>
>     * Should be unlikely to get out of sync with prose on the page.
>
> By making the prose on the page the source for the microdata, this seems
> resolved.
>
>
>     * Shouldn't require the consumer to write XSLT or server-side code to
>       read the calendar information.
>
> This is mostly met in the same way as for contact data.
>
>
>     * Machine-readable event data shouldn't be on a separate page than
>       human-readable dates.
>
> This is achieved using inline microdata.
>
>
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, iCalendar) in a consistent manner, so that tools that use
>       this information separate from the pages on which it is found have a
>       standard way of conveying the information.
>
> Output in all those formats except raw XML is explicitly supported in the
> spec.
>
>
>     * Should be possible for different parts of an event to be given in
>       different parts of the page. For example, a page with calendar events
>       in columns (with each row giving the time, date, place, etc) should
>       still have unambiguous calendar events parseable from it.
>
> subject="" supports this.
>
>
>     * Should be possible for authors to find out if people are reusing the
>       information on their site.
>
> This isn't met. I couldn't find a good way to do this. When JavaScript is
> enabled, drag-and-drop, copy-and-paste, and other mechanisms can be
> detected and logged via script, but really there's no good way to detect
> all uses of microdata. (Providing a ping=""-like feature for this seems
> like overkill and wouldn't help with non-end-user use anyway.)
>
>
>     * Code should not be ugly (e.g. should not be mixed in with markup used
>       mostly for styling).
>
> This appears to be met.
>
>
>     * There should be "obvious parsing tools for people to actually do
>       anything with the data (other than add an event to a calendar)".
>
> There aren't any obvious tools yet, but since two separate implementations
> arose in less than 24 hours from the point where the microdata stuff was
> released, it seems like this will prove easy enough to do.
>
>
>     * Solution should not feel "disconnected" from the Web the way that
>       calendar file downloads do.
>
> This seems met.
>
>
>     * Parsing rules should be unambiguous.
>     * Should not require changes to HTML5 parsing rules.
>
> The same applies here as with vCard.
>
>
>   USE CASE: Allow users to maintain bibliographies or otherwise keep track
>   of sources of quotes or references.
>
>   SCENARIOS:
>     * Frank copies a sentence from Wikipedia and pastes it in some word
>       processor: it would be great if the word processor offered to
>       automatically create a bibliographic entry.
>
> This will require new code in the word processor, but the information, in
> an HTML5-compliant browser according to this proposal, would include the
> information required to do this.
>
>
>     * Patrick keeps a list of his scientific publications on his web site.
>       He would like to provide structure within this publications page so
>       that Frank can automatically extract this information and use it to
>       cite Patrick's papers without having to transcribe the bibliographic
>       information.
>
> This seems to be handled directly now if the page is written using the
> BibTeX vocabulary.
>
>
>     * A scholar and teacher wants other scholars (and potentially students)
>       to be able to easily extract information about what he has published
>       to add it to their bibliographic applications.
>
> This seems met in the same way.
>
>
>     * A scholar and teacher wants to publish scholarly documents or content
>       that includes extensive citations that readers can then automatically
>       extract so that they can find them in their local university library.
>       These citations may be for a wide range of different sources: an
>       interview posted on YouTube, a legal opinion posted on the Supreme
>       Court web site, a press release from the White House.
>
> Not all of these types are immediately supported by the BibTeX vocabulary.
> I recommend that we extend the BibTeX set over time if this feature gains
> a critical mass.
>
>
>   REQUIREMENTS:
>     * Machine-readable bibliographic information shouldn't be on a separate
>       page than human-readable bibliographic information.
>
> This is met.
>
>
>     * The information should be convertible into a dedicated form (RDF,
>       JSON, XML, BibTex) in a consistent manner, so that tools that use this
>       information separate from the pages on which it is found have a
>       standard way of conveying the information.
>
> This is met explicitly for three of those types; for other types it can
> be done easily enough also though it is not defined in the spec.
>
>
>     * Parsing rules should be unambiguous.
>     * Should not require changes to HTML5 parsing rules.
>
> These are met in the same way as with vCard and vEvent microdata.
>
>
> In conclusion, to address these use cases and scenarios I've introduced
> three vocabularies based on past practices -- vCard, vEvent, and BibTeX --
> to the HTML5 specification, and I've defined how these vocabularies work
> in the context of the drag-and-drop model, which I believe is the core
> part of this proposal that has been lacking in other proposals previously.
>
>
> A number of further use cases remain to be examined, including one with
> scenarios regarding validating custom vocabularies and allowing editors to
> provide help with custom vocabularies. I will send further e-mail next
> week as I address them.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>



More information about the whatwg mailing list