[whatwg] Please review use cases relating to embedding micro-data in text/html

Ian Hickson ian at hixie.ch
Thu Apr 23 13:46:09 PDT 2009


[bcc'ed previous participants in this discussion]

Earlier this year I asked for use cases that HTML5 did not yet cover, with 
an emphasis on use cases relating to semantic microdata. I list below the 
use cases and requirements that I derived from the response to that 
request, and from related discussions.

I would appreciate it if people could review this list for errors or 
important omissions, before I go through the list to work out whether 
these use cases already have solutions, or whether we should have 
solutions for these use cases in HTML, or whether we should address these 
use cases with other technologies, or whatnot.

I encourage people to focus on the use cases themselves, rather than on 
potential solutions; various solutions to all these use cases have already 
been argued in great detail and I have already read all those e-mails, 
blog comments, wiki faqs, etc, carefully.

My primary concern right now is in making sure that these are indeed the 
use cases people care about, so that whatever we add to the spec can be 
carefully evaluated to make sure it is in fact solving the problems that 
we want solving.

==============================================================================

Exposing known data types in a reusable way

   USE CASE: Exposing calendar events so that users can add those events to
   their calendaring systems.

   SCENARIOS:

     * A user visits the Avenue Q site and wants to make a note of when
       tickets go on sale for the tour's stop in his home town. The site says
       "October 3rd", so the user clicks this and selects "add to calendar",
       which causes an entry to be added to his calendar.
     * A student is making a timeline of important events in Apple's history.
       As he reads Wikipedia entries on the topic, he clicks on dates and
       selects "add to timeline", which causes an entry to be added to his
       timeline.
     * TV guide listings - browsers should be able to expose to the user's
       tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
     * Paul sometimes gives talks on various topics, and announces them on
       his blog. He would like to mark up these announcements with proper
       scheduling information, so that his readers' software can
       automatically obtain the scheduling information and add it to their
       calendar. Importantly, some of the rendered data might be more
       informal than the machine-readable data required to produce a calendar
       event. Also of importance: Paul may want to annotate his event with a
       combination of existing vocabularies and a new vocabulary of his own
       design. (why?)
     * David can use the data in a web page to generate a custom browser UI
       for adding an event to our calendaring software without using brittle
       screen-scraping.

   REQUIREMENTS:

     * Should be discoverable.
     * Should be compatible with existing calendar systems.
     * Should be unlikely to get out of sync with prose on the page.
     * Shouldn't require the consumer to write XSLT or server-side code to
       read the calendar information.
     * Machine-readable event data shouldn't be on a separate page than
       human-readable dates.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, iCalendar) in a consistent manner, so that tools that use
       this information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of an event to be given in
       different parts of the page. For example, a page with calendar events
       in columns (with each row giving the time, date, place, etc) should
       still have unambiguous calendar events parseable from it.

   ---------------------------------------------------------------------------

   USE CASE: Exposing contact details so that users can add people to their
   address books or social networking sites.

   SCENARIOS:

     * Instead of giving a colleague a business card, someone gives their
       colleague a URL, and that colleague's user agent extracts basic
       profile information such as the person's name along with references to
       other people that person knows and adds the information into an
       address book.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about who he is to add it to
       their contact databases.
     * Fred copies the names of one of his Facebook friends and pastes it
       into his OS address book; the contact information is imported
       automatically.
     * Fred copies the names of one of his Facebook friends and pastes it
       into his Webmail's address book feature; the contact information is
       imported automatically.
     * David can use the data in a web page to generate a custom browser UI
       for including a person in our address book without using brittle
       screen-scraping.

   REQUIREMENTS:

     * A user joining a new social network should be able to identify himself
       to the new social network in way that enables the new social network
       to bootstrap his account from existing published data (e.g. from
       another social nework) rather than having to re-enter it, without the
       new site having to coordinate (or know about) the pre-existing site,
       without the user having to give either sites credentials to the other,
       and without the new site finding out about relationships that the user
       has intentionally kept secret.
       (http://w2spconf.com/2008/papers/s3p2.pdf)
     * Data should not need to be duplicated between machine-readable and
       human-readable forms (i.e. the human-readable form should be
       machine-readable).
     * Shouldn't require the consumer to write XSLT or server-side code to
       read the contact information.
     * Machine-readable contact information shouldn't be on a separate page
       than human-readable contact information.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, vCard) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of an event to be given in
       different parts of the page. For example, a page with contact details
       for people in columns (with each row giving the name, telephone
       number, etc) should still have unambiguous grouped contact details
       parseable from it.

   ---------------------------------------------------------------------------

   USE CASE: Allow users to maintain bibliographies or otherwise keep track
   of sources of quotes or references.

   SCENARIOS:

     * Frank copies a sentence from Wikipedia and pastes it in some word
       processor: it would be great if the word processor offered to
       automatically create a bibliographic entry.
     * Patrick keeps a list of his scientific publications on his web site.
       He would like to provide structure within this publications page so
       that Frank can automatically extract this information and use it to
       cite Patrick's papers without having to transcribe the bibliographic
       information.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he has published
       to add it to their bibliographic applications.
     * A scholar and teacher wants to publish scholarly documents or content
       that includes extensive citations that readers can then automatically
       extract so that they can find them in their local university library.
       These citations may be for a wide range of different sources: an
       interview posted on YouTube, a legal opinion posted on the Supreme
       Court web site, a press release from the White House.
     * A blog, say htmlfive.net, copies content wholesale from another, say
       blog.whatwg.org (as permitted and encouraged by the license). The
       author of the original content would like the reader of the reproduced
       content to know the provenance of the content. The reader would like
       to find the original blog post so he can leave comments for the
       original author.
     * Chaals could improve the Opera intranet if he had a mechanism for
       identifying the original source of various parts of a page. (why?)

   REQUIREMENTS:

     * Machine-readable bibliographic information shouldn't be on a separate
       page than human-readable bibliographic information.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML, BibTex) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.

   ---------------------------------------------------------------------------

   USE CASE: Help people searching for content to find content covered by
   licenses that suit their needs.

   SCENARIOS:

     * If a user is looking for recipes of pies to reproduce on his blog, he
       might want to exclude from his results any recipes that are not
       available under a license allowing non-commercial reproduction.
     * Lucy wants to publish her papers online. She includes an abstract of
       each one in a page, but because they are under different copyright
       rules, she needs to clarify what the rules are. A harvester such as
       the Open Access project can actually collect and index some of them
       with no problem, but may not be allowed to index others. Meanwhile, a
       human finds it more useful to see the abstracts on a page than have to
       guess from a bunch of titles whether to look at each abstract.
     * There are mapping organisations and data producers and people who take
       photos, and each may place different policies. Being able to keep that
       policy information helps people with further mashups avoiding
       violating a policy. For example, if GreatMaps.com has a public domain
       policy on their maps, CoolFotos.org has a policy that you can use data
       other than images for non-commercial purposes, and Johan Ichikawa has
       a photo there of my brother's cafe, which he has licensed as "must pay
       money", then it would be reasonable for me to copy the map and put it
       in a brochure for the cafe, but not to copy the data and photo from
       CoolFotos. On the other hand, if I am producing a non-commercial guide
       to cafes in Melbourne, I can add the map and the location of the cafe
       photo, but not the photo itself.
     * At University of Mary Washington, many faculty encourage students to
       blog about their studies to encourage more discussion using an
       instance of WordPress MultiUser. A student with have a blog might be
       writing posts relevant to more than one class. Professors would like
       to then aggregate relevant posts into one blog.
     * Tara runs a video sharing web site for people who want licensing
       information to be included with their videos. When Paul wants to blog
       about a video, he can paste a fragment of HTML provided by Tara
       directly into his blog. The video is then available inline in his
       blog, along with any licensing information about the video.
     * Fred's browser can tell him what license a particular video on a site
       he is reading has been released under, and advise him on what the
       associated permissions and restrictions are (can he redistribute this
       work for commercial purposes, can he distribute a modified version of
       this work, how should he assign credit to the original author, what
       jurisdiction the license assumes, whether the license allows the work
       to be embedded into a work that uses content under various other
       licenses, etc).

   REQUIREMENTS:

     * Content on a page might be covered by a different license than other
       content on the same page.
     * When licensing a subpart of the page, existing implementations must
       not just assume that the license applies to the whole page rather than
       just part of it.
     * License proliferation should be discouraged.
     * License information should be able to survive from one site to another
       as the data is transfered.
     * Expressing copyright licensing terms should be easy for content
       creators, publishers, and redistributors to provide.
     * It should be more convenient for the users (and tools) to find and
       evaluate copyright statements and licenses than it is today.
     * Shouldn't require the consumer to write XSLT or server-side code to
       process the license information.
     * Machine-readable licensing information shouldn't be on a separate page
       than human-readable licensing information.
     * There should not be ambiguous legal implications.

==============================================================================

Annotations

   USE CASE: Annotate structured data that HTML has no semantics for, and
   which nobody has annotated before, and may never again, for private use or
   use in a small self-contained community.

   SCENARIOS:

     * A group of users want to mark up their iguana collections so that they
       can write a script that collates all their collections and presents
       them in a uniform fashion.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he teaches to add
       it to their custom applications.
     * The list of specifications produced by W3C, for example, and various
       lists of translations, are produced by scraping source pages and
       outputting the result. This is brittle. It would be easier if the data
       was unambiguously obtainable from the source pages. This is a custom
       set of properties, specific to this community.
     * Chaals wants to make a list of the people who have translated W3C
       specifications or other documents, and then use this to search for
       people who are familiar with a given technology at least at some
       level, and happen to speak one or more languages of interest.
     * Chaals wants to have a reputation manager that can determine which of
       the many emails sent to the WHATWG list might be "more than usually
       valuable", and would like to seed this reputation manager from
       information gathered from the same source as the scraper that
       generates the W3C's TR/ page.
     * A user wants to write a script that finds the price of a book from an
       Amazon page.
     * Todd sells an HTML-based content management system, where all
       documents are processed and edited as HTML, sent from one editor to
       another, and eventually published and indexed. He would like to build
       up the editorial metadata used by the system within the HTML documents
       themselves, so that it is easier to manage and less likely to be lost.
     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush, and has either convinced
       the people making the statements that they should use a common
       language-neutral machine-readable vocabulary to describe their
       thoughts, or has convinced some other people to come in after them and
       process the thoughts manually to get them into a computer-readable
       form.

   REQUIREMENTS:

     * Vocabularies can be developed in a manner that won't clash with future
       more widely-used vocabularies, so that those future vocabularies can
       later be used in a page making use of private vocabularies without
       making the earlier annotations ambiguous.
     * Using the data should not involve learning a plethora of new APIs,
       formats, or vocabularies (today it is possible, e.g., to get the price
       of an Amazon product, but it requires learning a new API; similarly
       it's possible to get information from sites consistently using 'class'
       values in a documented way, but doing so requires learning a new
       vocabulary).
     * Shouldn't require the consumer to write XSLT or server-side code to
       process the annotated data.
     * Machine-readable annotations shouldn't be on a separate page than
       human-readable annotations.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of an item's data to be given
       in different parts of the page, for example two items described in the
       same paragraph. ("The two lamps and A and B. The first is $20, the
       second $30. The first is 5W, the second 7W.")
     * It should be possible to define globally-unique names, but the syntax
       should be optimised for a set of predefined vocabularies.
     * Adding this data to a page should be easy.
     * The syntax for adding this data should encourage the data to remain
       accurate when the page is changed.
     * The syntax should be resilient to intentional copy-and-paste
       authoring: people copying data into the page from a page that already
       has data should not have to know about any declarations far from the
       data.
     * The syntax should be resilient to unintentional copy-and-paste
       authoring: people copying markup from the page who do not know about
       these features should not inadvertently mark up their page with
       inapplicable data.

   ---------------------------------------------------------------------------

   USE CASE: Allow authors to annotate their documents to highlight the key
   parts, e.g. as when a student highlights parts of a printed page, but in a
   hypertext-aware fashion.

   SCENARIOS:

     * Fred writes a page about Napoleon. He can highlight the word Napoleon
       in a way that indicates to the reader that that is a person. Fred can
       also annotate the page to indicate that Napoleon and France are
       related concepts.

==============================================================================

Search

   USE CASE: Site owners want a way to provide enhanced search results to the
   engines, so that an entry in the search results page is more than just a
   bare link and snippet of text, and provides additional resources for users
   straight on the search page without them having to click into the page and
   discover those resources themselves.

   SCENARIOS:

     * For example, in response to a query for a restaurant, a search engine
       might want to have the result from yelp.com provide additional
       information, e.g. info on price, rating, and phone number, along with
       links to reviews or photos of the restaurant.

   REQUIREMENTS:

     * Information for the search engine should be on the same page as
       information that would be shown to the user if the user visited the
       page.

   ---------------------------------------------------------------------------

   USE CASE: Search engines and other site categorisation and aggregation
   engines should be able to determine the contents of pages with more
   accuracy than today.

   SCENARIOS

     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.
     * A blogger wishes to categorise his posts such that he can see them in
       the context of other posts on the same topic, including posts by
       unrelated authors (i.e. not via a pre-agreed tag or identifier, not
       via a single dedicated and preconfigured aggregator).
     * A user whose grandfather is called "Napoleon" wishes to ask Google the
       question "Who is Napoleon", and get as his answer a page describing
       his grandfather.
     * A user wants to ask about "Napoleon" but, instead of getting an
       answer, wants the search engine to ask him which Napoleon he wants to
       know about.

   REQUIREMENTS:

     * Should not disadvantage pages that are more useful to the user but
       that have not made any effort to help the search engine.
     * Should not be more susceptible to spamming than today's markup.

   ---------------------------------------------------------------------------

   USE CASE: Web browsers should be able to help users find information
   related to the items discussed by the page that they are looking at.

   SCENARIOS:

     * Finding more information about a movie when looking at a page about
       the movie, when the page contains detailed data about the movie.
          * For example, where the movie is playing locally.
          * For example, what your friends thought of it.
     * Exposing music samples on a page so that a user can listen to all the
       samples.
     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.
     * David can use the data in a web page to generate a custom browser UI
       for calling a phone number using our cellphone without using brittle
       screen-scraping.

   REQUIREMENTS:

     * Should be discoverable, because otherwise users will not use it, and
       thus users won't be helped.
     * Should be consistently available, because if it only works on some
       pages, users will not use it (see, for instance, the rel=next story).
     * Should be bootstrapable (rel=next failed because UAs didn't expose it
       because authors didn't use it because UAs didn't expose it).

   ---------------------------------------------------------------------------

   USE CASE: Finding distributed comments on audio and video media.

   SCENARIOS:

     * Sam has posted a video tutorial on how to grow tomatoes on his video
       blog. Jane uses the tutorial and would like to leave feedback to
       others that view the video regarding certain parts of the video she
       found most helpful. Since Sam has comments disabled on his blog, his
       users cannot comment on the particular sections of the video other
       than linking to it from their blog and entering the information there.
       Jane uses a video player that aggregates all the comments about the
       video found on the Web, and displays them as subtitles while she
       watches the video.

   REQUIREMENTS:

     * It shouldn't be possible for Jane to be exposed to spam comments.
     * The comment-aggregating video player shouldn't need to crawl the
       entire Web for each user independently.

   ---------------------------------------------------------------------------

   USE CASE: Allow users to price-check digital media (music, TV shows, etc)
   and purchase such content without having to go through a special website
   or application to acquire it, and without particular retailers being
   selected by the content's producer or publisher.

   SCENARIOS:

     * Joe wants to sell his music, but he doesn't want to sell it through a
       specific retailer, he wants to allow the user to pick a retailer. So
       he forgoes the chance of an affiliate fee, negotiates to have his
       music available in all retail stores that his users might prefer, and
       then puts a generic link on his page that identifies the product but
       doesn't identifier a retailer. Kyle, a fan, visits his page, clicks
       the link, and Amazon charges his credit card and puts the music into
       his Amazon album downloader. Leo instead clicks on the link and is
       automatically charged by Apple, and finds later that the music is in
       his iTunes library.
     * Manu wants to go to Joe's website but check the price of the offered
       music against the various retailers that sell it, without going to
       those retailers' sites, so that he can pick the cheapest retailer.
     * David can use the data in a web page to generate a custom browser UI
       for buying a song from our favorite online music store without using
       brittle screen-scraping.

   REQUIREMENTS:

     * Should not be easily prone to clickjacking (sites shouldn't be able to
       charge the user without the user's consent).
     * Should not make transactions harder when the user hasn't yet picked a
       favourite retailer.

==============================================================================

Cross-site communication

   USE CASE: Copy-and-paste should work between Web apps and native apps and
   between Web apps and other Web apps.

   SCENARIOS:

     * Fred copies an e-mail from Apple Mail into GMail, and the e-mail
       survives intact, including headers, attachments, and multipart/related
       parts.
     * Fred copies an e-mail from GMail into Hotmail, and the e-mail survives
       intact, including headers, attachments, and multipart/related parts.

   ---------------------------------------------------------------------------

   USE CASE: Allow users to share data between sites (e.g. between an online
   store and a price comparison site).

   SCENARIOS

     * Lucy is looking for a new apartment and some items with which to
       furnish it. She browses various web pages, including apartment
       listings, furniture stores, kitchen appliances, etc. Every time she
       finds an item she likes, she points to it and transfers its details to
       her apartment-hunting page, where her picks can be organized, sorted,
       and categorized.
     * Lucy uses a website called TheBigMove.com to organize all aspects of
       her move, including items that she is tracking for the move. She goes
       to her "To Do" list and adds some of the items she collected during
       her visits to various Web sites, so that TheBigMove.com can handle the
       purchasing and delivery for her.

   REQUIREMENTS:

     * Should be discoverable, because otherwise users will not use it, and
       thus users won't be helped.
     * Should be consistently available, because if it only works on some
       pages, users will not use it (see, for instance, the rel=next story).
     * Should be bootstrapable (rel=next failed because UAs didn't expose it
       because authors didn't use it because UAs didn't expose it).
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.

==============================================================================

Blogging

   USE CASE: Remove the need for feeds to restate the content of HTML pages
   (i.e. replace Atom with HTML).

   SCENARIOS:

     * Paul maintains a blog and wishes to write his blog in such a way that
       tools can pick up his blog post tags, authors, titles, and his
       blogroll directly from his blog, so that he does not need to maintain
       a parallel version of his data in a "structured format." In other
       words, his HTML blog should be usable as its own structured feed.

   ---------------------------------------------------------------------------

   USE CASE: Allow users to compare subjects of blog entries when the
   subjects are hard to tersely identify relative to other subjects in the
   same general area.

   SCENARIOS:

     * Paul blogs about proteins and genes. His colleagues also blog about
       proteins and genes. Proteins and genes are identified by long
       hard-to-compare strings, but Paul and his colleagues can determine if
       they are talking about the same things by having their user agent
       compare some sort of flags embedded in the blogs.
     * Rob wants to publish a large vocabulary in RDFS and/or OWL. Rob also
       wants to provide a clear, human readable description of the same
       vocabulary, that mixes the terms with descriptive text in HTML.

==============================================================================

Data extraction from uncooperative sources

   USE CASE: Getting data out of poorly written Web pages, so that the user
   can find more information about the page's contents.

   SCENARIOS:

     * Alfred merges data from various sources in a static manner, generating
       a new set of data. Bob later uses this static data in conjunction with
       other data sets to generate yet another set of static data. Julie then
       visits Bob's page later, and wants to know where and when the various
       sources of data Bob used come from, so that she can evaluate its
       quality. (In this instance, Alfred and Bob are assumed to be
       uncooperative, since creating a static mashup would be an example of a
       poorly-written page.)
     * TV guide listings - If the TV guide provider does not render a link to
       IMDB, the browser should recognise TV shows and give implicit links.
       (In this instance, it is assumed that the TV guide provider is
       uncooperative, since it isn't providing the links the user wants.)
     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.
       (In this instance, it is assumed that the teachers and students aren't
       cooperative, since they would otherwise be able to find each other by
       listing their blogs in a common directory.)
     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush. (In this instance, it is
       assumed that the people writing the statements aren't cooperative,
       since if they were they could just add the data straight into the
       knowledge base.)

   REQUIREMENTS:

     * Does not need cooperation of the author (if the page author was
       cooperative, the page would be well-written).
     * Shouldn't require the consumer to write XSLT or server-side code to
       derive this information from the page.

   ---------------------------------------------------------------------------

   USE CASE: Remove the need for RDF users to restate information in online
   encyclopedias (i.e. replace DBpedia).

   SCENARIOS:

     * A user wants to have information in RDF form. The user visits
       Wikipedia, and his user agent can obtain the information without
       relying on DBpedia's interpretation of the page.

   REQUIREMENTS:

     * All the data exposed by DBpedia should be derivable from Wikipedia
       without using DBpedia.

==============================================================================

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



More information about the whatwg mailing list