[whatwg] Please review use cases relating to embedding micro-data in text/html

Fri Apr 24 04:35:05 PDT 2009

The contacts section uses "event" where it meant contact

On 4/23/09, Ian Hickson <ian at hixie.ch> wrote:
>
> [bcc'ed previous participants in this discussion]
>
> Earlier this year I asked for use cases that HTML5 did not yet cover, with
> an emphasis on use cases relating to semantic microdata. I list below the
> use cases and requirements that I derived from the response to that
> request, and from related discussions.
>
> I would appreciate it if people could review this list for errors or
> important omissions, before I go through the list to work out whether
> these use cases already have solutions, or whether we should have
> solutions for these use cases in HTML, or whether we should address these
> use cases with other technologies, or whatnot.
>
> I encourage people to focus on the use cases themselves, rather than on
> potential solutions; various solutions to all these use cases have already
> been argued in great detail and I have already read all those e-mails,
> blog comments, wiki faqs, etc, carefully.
>
> My primary concern right now is in making sure that these are indeed the
> use cases people care about, so that whatever we add to the spec can be
> carefully evaluated to make sure it is in fact solving the problems that
> we want solving.
>
> ==============================================================================
>
> Exposing known data types in a reusable way
>
>    USE CASE: Exposing calendar events so that users can add those events to
>    their calendaring systems.
>
>    SCENARIOS:
>
>      * A user visits the Avenue Q site and wants to make a note of when
>        tickets go on sale for the tour's stop in his home town. The site
> says
>        "October 3rd", so the user clicks this and selects "add to calendar",
>        which causes an entry to be added to his calendar.
>      * A student is making a timeline of important events in Apple's
> history.
>        As he reads Wikipedia entries on the topic, he clicks on dates and
>        selects "add to timeline", which causes an entry to be added to his
>        timeline.
>      * TV guide listings - browsers should be able to expose to the user's
>        tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on.
>      * Paul sometimes gives talks on various topics, and announces them on
>        his blog. He would like to mark up these announcements with proper
>        scheduling information, so that his readers' software can
>        automatically obtain the scheduling information and add it to their
>        calendar. Importantly, some of the rendered data might be more
>        informal than the machine-readable data required to produce a
> calendar
>        event. Also of importance: Paul may want to annotate his event with a
>        combination of existing vocabularies and a new vocabulary of his own
>        design. (why?)
>      * David can use the data in a web page to generate a custom browser UI
>        for adding an event to our calendaring software without using brittle
>        screen-scraping.
>
>    REQUIREMENTS:
>
>      * Should be discoverable.
>      * Should be compatible with existing calendar systems.
>      * Should be unlikely to get out of sync with prose on the page.
>      * Shouldn't require the consumer to write XSLT or server-side code to
>        read the calendar information.
>      * Machine-readable event data shouldn't be on a separate page than
>        human-readable dates.
>      * The information should be convertible into a dedicated form (RDF,
>        JSON, XML, iCalendar) in a consistent manner, so that tools that use
>        this information separate from the pages on which it is found have a
>        standard way of conveying the information.
>      * Should be possible for different parts of an event to be given in
>        different parts of the page. For example, a page with calendar events
>        in columns (with each row giving the time, date, place, etc) should
>        still have unambiguous calendar events parseable from it.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Exposing contact details so that users can add people to their
>    address books or social networking sites.
>
>    SCENARIOS:
>
>      * Instead of giving a colleague a business card, someone gives their
>        colleague a URL, and that colleague's user agent extracts basic
>        profile information such as the person's name along with references
> to
>        other people that person knows and adds the information into an
>        address book.
>      * A scholar and teacher wants other scholars (and potentially students)
>        to be able to easily extract information about who he is to add it to
>        their contact databases.
>      * Fred copies the names of one of his Facebook friends and pastes it
>        into his OS address book; the contact information is imported
>        automatically.
>      * Fred copies the names of one of his Facebook friends and pastes it
>        into his Webmail's address book feature; the contact information is
>        imported automatically.
>      * David can use the data in a web page to generate a custom browser UI
>        for including a person in our address book without using brittle
>        screen-scraping.
>
>    REQUIREMENTS:
>
>      * A user joining a new social network should be able to identify
> himself
>        to the new social network in way that enables the new social network
>        to bootstrap his account from existing published data (e.g. from
>        another social nework) rather than having to re-enter it, without the
>        new site having to coordinate (or know about) the pre-existing site,
>        without the user having to give either sites credentials to the
> other,
>        and without the new site finding out about relationships that the
> user
>        has intentionally kept secret.
>        (http://w2spconf.com/2008/papers/s3p2.pdf)
>      * Data should not need to be duplicated between machine-readable and
>        human-readable forms (i.e. the human-readable form should be
>        machine-readable).
>      * Shouldn't require the consumer to write XSLT or server-side code to
>        read the contact information.
>      * Machine-readable contact information shouldn't be on a separate page
>        than human-readable contact information.
>      * The information should be convertible into a dedicated form (RDF,
>        JSON, XML, vCard) in a consistent manner, so that tools that use this
>        information separate from the pages on which it is found have a
>        standard way of conveying the information.
>      * Should be possible for different parts of an event to be given in
>        different parts of the page. For example, a page with contact details
>        for people in columns (with each row giving the name, telephone
>        number, etc) should still have unambiguous grouped contact details
>        parseable from it.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Allow users to maintain bibliographies or otherwise keep track
>    of sources of quotes or references.
>
>    SCENARIOS:
>
>      * Frank copies a sentence from Wikipedia and pastes it in some word
>        processor: it would be great if the word processor offered to
>        automatically create a bibliographic entry.
>      * Patrick keeps a list of his scientific publications on his web site.
>        He would like to provide structure within this publications page so
>        that Frank can automatically extract this information and use it to
>        cite Patrick's papers without having to transcribe the bibliographic
>        information.
>      * A scholar and teacher wants other scholars (and potentially students)
>        to be able to easily extract information about what he has published
>        to add it to their bibliographic applications.
>      * A scholar and teacher wants to publish scholarly documents or content
>        that includes extensive citations that readers can then automatically
>        extract so that they can find them in their local university library.
>        These citations may be for a wide range of different sources: an
>        interview posted on YouTube, a legal opinion posted on the Supreme
>        Court web site, a press release from the White House.
>      * A blog, say htmlfive.net, copies content wholesale from another, say
>        blog.whatwg.org (as permitted and encouraged by the license). The
>        author of the original content would like the reader of the
> reproduced
>        content to know the provenance of the content. The reader would like
>        to find the original blog post so he can leave comments for the
>        original author.
>      * Chaals could improve the Opera intranet if he had a mechanism for
>        identifying the original source of various parts of a page. (why?)
>
>    REQUIREMENTS:
>
>      * Machine-readable bibliographic information shouldn't be on a separate
>        page than human-readable bibliographic information.
>      * The information should be convertible into a dedicated form (RDF,
>        JSON, XML, BibTex) in a consistent manner, so that tools that use
> this
>        information separate from the pages on which it is found have a
>        standard way of conveying the information.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Help people searching for content to find content covered by
>    licenses that suit their needs.
>
>    SCENARIOS:
>
>      * If a user is looking for recipes of pies to reproduce on his blog, he
>        might want to exclude from his results any recipes that are not
>        available under a license allowing non-commercial reproduction.
>      * Lucy wants to publish her papers online. She includes an abstract of
>        each one in a page, but because they are under different copyright
>        rules, she needs to clarify what the rules are. A harvester such as
>        the Open Access project can actually collect and index some of them
>        with no problem, but may not be allowed to index others. Meanwhile, a
>        human finds it more useful to see the abstracts on a page than have
> to
>        guess from a bunch of titles whether to look at each abstract.
>      * There are mapping organisations and data producers and people who
> take
>        photos, and each may place different policies. Being able to keep
> that
>        policy information helps people with further mashups avoiding
>        violating a policy. For example, if GreatMaps.com has a public domain
>        policy on their maps, CoolFotos.org has a policy that you can use
> data
>        other than images for non-commercial purposes, and Johan Ichikawa has
>        a photo there of my brother's cafe, which he has licensed as "must
> pay
>        money", then it would be reasonable for me to copy the map and put it
>        in a brochure for the cafe, but not to copy the data and photo from
>        CoolFotos. On the other hand, if I am producing a non-commercial
> guide
>        to cafes in Melbourne, I can add the map and the location of the cafe
>        photo, but not the photo itself.
>      * At University of Mary Washington, many faculty encourage students to
>        blog about their studies to encourage more discussion using an
>        instance of WordPress MultiUser. A student with have a blog might be
>        writing posts relevant to more than one class. Professors would like
>        to then aggregate relevant posts into one blog.
>      * Tara runs a video sharing web site for people who want licensing
>        information to be included with their videos. When Paul wants to blog
>        about a video, he can paste a fragment of HTML provided by Tara
>        directly into his blog. The video is then available inline in his
>        blog, along with any licensing information about the video.
>      * Fred's browser can tell him what license a particular video on a site
>        he is reading has been released under, and advise him on what the
>        associated permissions and restrictions are (can he redistribute this
>        work for commercial purposes, can he distribute a modified version of
>        this work, how should he assign credit to the original author, what
>        jurisdiction the license assumes, whether the license allows the work
>        to be embedded into a work that uses content under various other
>        licenses, etc).
>
>    REQUIREMENTS:
>
>      * Content on a page might be covered by a different license than other
>        content on the same page.
>      * When licensing a subpart of the page, existing implementations must
>        not just assume that the license applies to the whole page rather
> than
>        just part of it.
>      * License proliferation should be discouraged.
>      * License information should be able to survive from one site to
> another
>        as the data is transfered.
>      * Expressing copyright licensing terms should be easy for content
>        creators, publishers, and redistributors to provide.
>      * It should be more convenient for the users (and tools) to find and
>        evaluate copyright statements and licenses than it is today.
>      * Shouldn't require the consumer to write XSLT or server-side code to
>        process the license information.
>      * Machine-readable licensing information shouldn't be on a separate
> page
>        than human-readable licensing information.
>      * There should not be ambiguous legal implications.
>
> ==============================================================================
>
> Annotations
>
>    USE CASE: Annotate structured data that HTML has no semantics for, and
>    which nobody has annotated before, and may never again, for private use
> or
>    use in a small self-contained community.
>
>    SCENARIOS:
>
>      * A group of users want to mark up their iguana collections so that
> they
>        can write a script that collates all their collections and presents
>        them in a uniform fashion.
>      * A scholar and teacher wants other scholars (and potentially students)
>        to be able to easily extract information about what he teaches to add
>        it to their custom applications.
>      * The list of specifications produced by W3C, for example, and various
>        lists of translations, are produced by scraping source pages and
>        outputting the result. This is brittle. It would be easier if the
> data
>        was unambiguously obtainable from the source pages. This is a custom
>        set of properties, specific to this community.
>      * Chaals wants to make a list of the people who have translated W3C
>        specifications or other documents, and then use this to search for
>        people who are familiar with a given technology at least at some
>        level, and happen to speak one or more languages of interest.
>      * Chaals wants to have a reputation manager that can determine which of
>        the many emails sent to the WHATWG list might be "more than usually
>        valuable", and would like to seed this reputation manager from
>        information gathered from the same source as the scraper that
>        generates the W3C's TR/ page.
>      * A user wants to write a script that finds the price of a book from an
>        Amazon page.
>      * Todd sells an HTML-based content management system, where all
>        documents are processed and edited as HTML, sent from one editor to
>        another, and eventually published and indexed. He would like to build
>        up the editorial metadata used by the system within the HTML
> documents
>        themselves, so that it is easier to manage and less likely to be
> lost.
>      * Tim wants to make a knowledge base seeded from statements made in
>        Spanish and English, e.g. from people writing down their thoughts
>        about George W. Bush and George H.W. Bush, and has either convinced
>        the people making the statements that they should use a common
>        language-neutral machine-readable vocabulary to describe their
>        thoughts, or has convinced some other people to come in after them
> and
>        process the thoughts manually to get them into a computer-readable
>        form.
>
>    REQUIREMENTS:
>
>      * Vocabularies can be developed in a manner that won't clash with
> future
>        more widely-used vocabularies, so that those future vocabularies can
>        later be used in a page making use of private vocabularies without
>        making the earlier annotations ambiguous.
>      * Using the data should not involve learning a plethora of new APIs,
>        formats, or vocabularies (today it is possible, e.g., to get the
> price
>        of an Amazon product, but it requires learning a new API; similarly
>        it's possible to get information from sites consistently using
> 'class'
>        values in a documented way, but doing so requires learning a new
>        vocabulary).
>      * Shouldn't require the consumer to write XSLT or server-side code to
>        process the annotated data.
>      * Machine-readable annotations shouldn't be on a separate page than
>        human-readable annotations.
>      * The information should be convertible into a dedicated form (RDF,
>        JSON, XML) in a consistent manner, so that tools that use this
>        information separate from the pages on which it is found have a
>        standard way of conveying the information.
>      * Should be possible for different parts of an item's data to be given
>        in different parts of the page, for example two items described in
> the
>        same paragraph. ("The two lamps and A and B. The first is $20, the
>        second $30. The first is 5W, the second 7W.")
>      * It should be possible to define globally-unique names, but the syntax
>        should be optimised for a set of predefined vocabularies.
>      * Adding this data to a page should be easy.
>      * The syntax for adding this data should encourage the data to remain
>        accurate when the page is changed.
>      * The syntax should be resilient to intentional copy-and-paste
>        authoring: people copying data into the page from a page that already
>        has data should not have to know about any declarations far from the
>        data.
>      * The syntax should be resilient to unintentional copy-and-paste
>        authoring: people copying markup from the page who do not know about
>        these features should not inadvertently mark up their page with
>        inapplicable data.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Allow authors to annotate their documents to highlight the key
>    parts, e.g. as when a student highlights parts of a printed page, but in
> a
>    hypertext-aware fashion.
>
>    SCENARIOS:
>
>      * Fred writes a page about Napoleon. He can highlight the word Napoleon
>        in a way that indicates to the reader that that is a person. Fred can
>        also annotate the page to indicate that Napoleon and France are
>        related concepts.
>
> ==============================================================================
>
> Search
>
>    USE CASE: Site owners want a way to provide enhanced search results to
> the
>    engines, so that an entry in the search results page is more than just a
>    bare link and snippet of text, and provides additional resources for
> users
>    straight on the search page without them having to click into the page
> and
>    discover those resources themselves.
>
>    SCENARIOS:
>
>      * For example, in response to a query for a restaurant, a search engine
>        might want to have the result from yelp.com provide additional
>        information, e.g. info on price, rating, and phone number, along with
>        links to reviews or photos of the restaurant.
>
>    REQUIREMENTS:
>
>      * Information for the search engine should be on the same page as
>        information that would be shown to the user if the user visited the
>        page.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Search engines and other site categorisation and aggregation
>    engines should be able to determine the contents of pages with more
>    accuracy than today.
>
>    SCENARIOS
>
>      * Students and teachers should be able to discover each other -- both
>        within an institution and across institutions -- via their blogging.
>      * A blogger wishes to categorise his posts such that he can see them in
>        the context of other posts on the same topic, including posts by
>        unrelated authors (i.e. not via a pre-agreed tag or identifier, not
>        via a single dedicated and preconfigured aggregator).
>      * A user whose grandfather is called "Napoleon" wishes to ask Google
> the
>        question "Who is Napoleon", and get as his answer a page describing
>        his grandfather.
>      * A user wants to ask about "Napoleon" but, instead of getting an
>        answer, wants the search engine to ask him which Napoleon he wants to
>        know about.
>
>    REQUIREMENTS:
>
>      * Should not disadvantage pages that are more useful to the user but
>        that have not made any effort to help the search engine.
>      * Should not be more susceptible to spamming than today's markup.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Web browsers should be able to help users find information
>    related to the items discussed by the page that they are looking at.
>
>    SCENARIOS:
>
>      * Finding more information about a movie when looking at a page about
>        the movie, when the page contains detailed data about the movie.
>           * For example, where the movie is playing locally.
>           * For example, what your friends thought of it.
>      * Exposing music samples on a page so that a user can listen to all the
>        samples.
>      * Students and teachers should be able to discover each other -- both
>        within an institution and across institutions -- via their blogging.
>      * David can use the data in a web page to generate a custom browser UI
>        for calling a phone number using our cellphone without using brittle
>        screen-scraping.
>
>    REQUIREMENTS:
>
>      * Should be discoverable, because otherwise users will not use it, and
>        thus users won't be helped.
>      * Should be consistently available, because if it only works on some
>        pages, users will not use it (see, for instance, the rel=next story).
>      * Should be bootstrapable (rel=next failed because UAs didn't expose it
>        because authors didn't use it because UAs didn't expose it).
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Finding distributed comments on audio and video media.
>
>    SCENARIOS:
>
>      * Sam has posted a video tutorial on how to grow tomatoes on his video
>        blog. Jane uses the tutorial and would like to leave feedback to
>        others that view the video regarding certain parts of the video she
>        found most helpful. Since Sam has comments disabled on his blog, his
>        users cannot comment on the particular sections of the video other
>        than linking to it from their blog and entering the information
> there.
>        Jane uses a video player that aggregates all the comments about the
>        video found on the Web, and displays them as subtitles while she
>        watches the video.
>
>    REQUIREMENTS:
>
>      * It shouldn't be possible for Jane to be exposed to spam comments.
>      * The comment-aggregating video player shouldn't need to crawl the
>        entire Web for each user independently.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Allow users to price-check digital media (music, TV shows, etc)
>    and purchase such content without having to go through a special website
>    or application to acquire it, and without particular retailers being
>    selected by the content's producer or publisher.
>
>    SCENARIOS:
>
>      * Joe wants to sell his music, but he doesn't want to sell it through a
>        specific retailer, he wants to allow the user to pick a retailer. So
>        he forgoes the chance of an affiliate fee, negotiates to have his
>        music available in all retail stores that his users might prefer, and
>        then puts a generic link on his page that identifies the product but
>        doesn't identifier a retailer. Kyle, a fan, visits his page, clicks
>        the link, and Amazon charges his credit card and puts the music into
>        his Amazon album downloader. Leo instead clicks on the link and is
>        automatically charged by Apple, and finds later that the music is in
>        his iTunes library.
>      * Manu wants to go to Joe's website but check the price of the offered
>        music against the various retailers that sell it, without going to
>        those retailers' sites, so that he can pick the cheapest retailer.
>      * David can use the data in a web page to generate a custom browser UI
>        for buying a song from our favorite online music store without using
>        brittle screen-scraping.
>
>    REQUIREMENTS:
>
>      * Should not be easily prone to clickjacking (sites shouldn't be able
> to
>        charge the user without the user's consent).
>      * Should not make transactions harder when the user hasn't yet picked a
>        favourite retailer.
>
> ==============================================================================
>
> Cross-site communication
>
>    USE CASE: Copy-and-paste should work between Web apps and native apps and
>    between Web apps and other Web apps.
>
>    SCENARIOS:
>
>      * Fred copies an e-mail from Apple Mail into GMail, and the e-mail
>        survives intact, including headers, attachments, and
> multipart/related
>        parts.
>      * Fred copies an e-mail from GMail into Hotmail, and the e-mail
> survives
>        intact, including headers, attachments, and multipart/related parts.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Allow users to share data between sites (e.g. between an online
>    store and a price comparison site).
>
>    SCENARIOS
>
>      * Lucy is looking for a new apartment and some items with which to
>        furnish it. She browses various web pages, including apartment
>        listings, furniture stores, kitchen appliances, etc. Every time she
>        finds an item she likes, she points to it and transfers its details
> to
>        her apartment-hunting page, where her picks can be organized, sorted,
>        and categorized.
>      * Lucy uses a website called TheBigMove.com to organize all aspects of
>        her move, including items that she is tracking for the move. She goes
>        to her "To Do" list and adds some of the items she collected during
>        her visits to various Web sites, so that TheBigMove.com can handle
> the
>        purchasing and delivery for her.
>
>    REQUIREMENTS:
>
>      * Should be discoverable, because otherwise users will not use it, and
>        thus users won't be helped.
>      * Should be consistently available, because if it only works on some
>        pages, users will not use it (see, for instance, the rel=next story).
>      * Should be bootstrapable (rel=next failed because UAs didn't expose it
>        because authors didn't use it because UAs didn't expose it).
>      * The information should be convertible into a dedicated form (RDF,
>        JSON, XML) in a consistent manner, so that tools that use this
>        information separate from the pages on which it is found have a
>        standard way of conveying the information.
>
> ==============================================================================
>
> Blogging
>
>    USE CASE: Remove the need for feeds to restate the content of HTML pages
>    (i.e. replace Atom with HTML).
>
>    SCENARIOS:
>
>      * Paul maintains a blog and wishes to write his blog in such a way that
>        tools can pick up his blog post tags, authors, titles, and his
>        blogroll directly from his blog, so that he does not need to maintain
>        a parallel version of his data in a "structured format." In other
>        words, his HTML blog should be usable as its own structured feed.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Allow users to compare subjects of blog entries when the
>    subjects are hard to tersely identify relative to other subjects in the
>    same general area.
>
>    SCENARIOS:
>
>      * Paul blogs about proteins and genes. His colleagues also blog about
>        proteins and genes. Proteins and genes are identified by long
>        hard-to-compare strings, but Paul and his colleagues can determine if
>        they are talking about the same things by having their user agent
>        compare some sort of flags embedded in the blogs.
>      * Rob wants to publish a large vocabulary in RDFS and/or OWL. Rob also
>        wants to provide a clear, human readable description of the same
>        vocabulary, that mixes the terms with descriptive text in HTML.
>
> ==============================================================================
>
> Data extraction from uncooperative sources
>
>    USE CASE: Getting data out of poorly written Web pages, so that the user
>    can find more information about the page's contents.
>
>    SCENARIOS:
>
>      * Alfred merges data from various sources in a static manner,
> generating
>        a new set of data. Bob later uses this static data in conjunction
> with
>        other data sets to generate yet another set of static data. Julie
> then
>        visits Bob's page later, and wants to know where and when the various
>        sources of data Bob used come from, so that she can evaluate its
>        quality. (In this instance, Alfred and Bob are assumed to be
>        uncooperative, since creating a static mashup would be an example of
> a
>        poorly-written page.)
>      * TV guide listings - If the TV guide provider does not render a link
> to
>        IMDB, the browser should recognise TV shows and give implicit links.
>        (In this instance, it is assumed that the TV guide provider is
>        uncooperative, since it isn't providing the links the user wants.)
>      * Students and teachers should be able to discover each other -- both
>        within an institution and across institutions -- via their blogging.
>        (In this instance, it is assumed that the teachers and students
> aren't
>        cooperative, since they would otherwise be able to find each other by
>        listing their blogs in a common directory.)
>      * Tim wants to make a knowledge base seeded from statements made in
>        Spanish and English, e.g. from people writing down their thoughts
>        about George W. Bush and George H.W. Bush. (In this instance, it is
>        assumed that the people writing the statements aren't cooperative,
>        since if they were they could just add the data straight into the
>        knowledge base.)
>
>    REQUIREMENTS:
>
>      * Does not need cooperation of the author (if the page author was
>        cooperative, the page would be well-written).
>      * Shouldn't require the consumer to write XSLT or server-side code to
>        derive this information from the page.
>
>
> ---------------------------------------------------------------------------
>
>    USE CASE: Remove the need for RDF users to restate information in online
>    encyclopedias (i.e. replace DBpedia).
>
>    SCENARIOS:
>
>      * A user wants to have information in RDF form. The user visits
>        Wikipedia, and his user agent can obtain the information without
>        relying on DBpedia's interpretation of the page.
>
>    REQUIREMENTS:
>
>      * All the data exposed by DBpedia should be derivable from Wikipedia
>        without using DBpedia.
>
> ==============================================================================
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>

-- 
Sent from my mobile device