[whatwg] Getting data out of poorly written Web pages

Mon May 4 20:01:51 PDT 2009

One of the use cases I collected from the e-mails sent in over the past 
few months was the following:

   USE CASE: Getting data out of poorly written Web pages, so that the user
   can find more information about the page's contents.

   SCENARIOS:
     * Alfred merges data from various sources in a static manner, generating
       a new set of data. Bob later uses this static data in conjunction with
       other data sets to generate yet another set of static data. Julie then
       visits Bob's page later, and wants to know where and when the various
       sources of data Bob used come from, so that she can evaluate its
       quality. (In this instance, Alfred and Bob are assumed to be
       uncooperative, since creating a static mashup would be an example of a
       poorly-written page.)
     * TV guide listings - If the TV guide provider does not render a link to
       IMDB, the browser should recognise TV shows and give implicit links.
       (In this instance, it is assumed that the TV guide provider is
       uncooperative, since it isn't providing the links the user wants.)
     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.
       (In this instance, it is assumed that the teachers and students aren't
       cooperative, since they would otherwise be able to find each other by
       listing their blogs in a common directory.)
     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush. (In this instance, it is
       assumed that the people writing the statements aren't cooperative,
       since if they were they could just add the data straight into the
       knowledge base.)

   REQUIREMENTS:
     * Does not need cooperation of the author (if the page author was
       cooperative, the page would be well-written).
     * Shouldn't require the consumer to write XSLT or server-side code to
       derive this information from the page.

One class of the solutions that was proposed to address this is the idea 
of getting the author to mark up microdata (small bits of data) in the 
page, annotating the information that is needed to complete the scenarios. 
Such formats could be RDFa, Microformats, n3, a custom format for HTML5, 
or any number of other syntaxes. However, it's not clear that this would 
help in this case, since the underlying assumption with these particular 
problems is that the author isn't actively cooperating with the user 
(likely due to ignorance, of course, not malice).

Let's examine these use cases with a microdata solution in mind:

     * Alfred merges data from various sources in a static manner, generating
       a new set of data. Bob later uses this static data in conjunction with
       other data sets to generate yet another set of static data. Julie then
       visits Bob's page later, and wants to know where and when the various
       sources of data Bob used come from, so that she can evaluate its
       quality.

Here, Julie is two steps removed from the original data. Since we are 
assuming here that Alfred and Bob are not cooperating with Julie, we must 
also assume that they haven't included this information on the page. If 
they haven't included it, then microdata doesn't help, as there is nothing 
to mark up.

     * TV guide listings - If the TV guide provider does not render a link to
       IMDB, the browser should recognise TV shows and give implicit links.
       (In this instance, it is assumed that the TV guide provider is
       uncooperative, since it isn't providing the links the user wants.)

If the TV guide listing page was cooperative, it would just provide the 
links to the IMDB that the user wants. It isn't; we cannot, therefore, 
assume that it would be ready to include microdata that would let the user 
find the relevant page on the IMDB using a tool that consumes Microdata.

     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.

The obvious solution here is for the students and teachers to simply 
register their blogs in a common directory. However, assuming that they 
are not even doing that, it is unlikely that they _would_ include some 
kind of microdata in their pages to solve the problem.

     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush.

If the people writing down their thoughts were to be "hip" enough to write 
their thoughts using microdata annotations, they'd also be able to just 
add them to the knowledge base directly. So again, we must assume that 
this is a case where we can't rely on microdata.

Thus, we have our first requirement:

     * Does not need cooperation of the author.

If we take the author out of the equation, there are three other parties 
that could help solve this problem:

 1. The user.

 2. The user's client tool provider (e.g. browser vendor).

 3. Third party tool providers (e.g. web sites, search engines).

Relying on the user to solve these problems is somewhat missing the point 
of solving the problems, so let's focus on the browser and on other tools.

The other requirement listed above, from someone who presumably wishes to 
avoid the user having to do any extra work, is:

     * Shouldn't require the consumer to write XSLT or server-side code to
       derive this information from the page.

This is worth bearing in mind as we look at how browsers and other tools 
might help solve the problem.

First let's look at the scenarios again, from the perspective of the 
client software:

     * Alfred merges data from various sources in a static manner, generating
       a new set of data. Bob later uses this static data in conjunction with
       other data sets to generate yet another set of static data. Julie then
       visits Bob's page later, and wants to know where and when the various
       sources of data Bob used come from, so that she can evaluate its
       quality.

>From the browser's point of view, Julie is viewing a Web page with some 
data, say, some HTML <table>s, and requests the browser's help in 
identifying the source of the data.

It's not clear to me that the browser could do _anything_ at this point 
that would solve the problem. Without help from the page, finding the 
origin of data is a search problem, and the browser doesn't really have 
anywhere to begin from.

     * TV guide listings - If the TV guide provider does not render a link to
       IMDB, the browser should recognise TV shows and give implicit links.

>From the browser's point of view, the user is visiting a page with various 
bits of text on it.

There has been some work in the area of having browsers give implicit 
links, but that has historically not been successful at all:

   http://en.wikipedia.org/wiki/Smart_tag_(Microsoft)

However, as that Wikipedia page points out, what _has_ been moderately 
successful (so far) is the idea of having the browser offer links when the 
user selects some text. Thus, if the user is an IMDB user, he could select 
the TV show title, and select "IMDB" from the resulting menu.

This solution does solve the problem without XSLT or server-side consumer 
code. Thus, this appears to be a solution to this particular scenario.

     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.

Form the browser's point of view, the user is browsing one page, and 
desires other pages that are similar in a particular way. Again, this is 
fundamentally a search problem, so it's not clear that there's anything 
that could be done to address it from the browser.

     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush.

Here the client is not a browser, but some other tool, whose job it is to 
populate a knowledge base from statements in Spanish and English. Almost 
by definition then, it seems like this tool should, as part of its 
operation, be able to convert English and Spanish into the knowledge 
base's format. Such tools currently are not widely available to the 
general public. That's probably ok, though, since to be honest, the 
general public is unlikely to make direct use of knowledge bases at this 
point anyway.

Whether this requires some code from the user (as opposed to being 
automatic) depends on the software, but software that can interpret such 
statements (i.e. AI or NLP software) would presumably do so without help 
from the user.

Unfortunately, such solutions are somewhat hypothetical at this point. 
Thus client software is a possible solution, but not a great one.

Let's look at the scenarios again from the point of view of a third-party 
software provider, e.g. a search engine:

     * Alfred merges data from various sources in a static manner, generating
       a new set of data. Bob later uses this static data in conjunction with
       other data sets to generate yet another set of static data. Julie then
       visits Bob's page later, and wants to know where and when the various
       sources of data Bob used come from, so that she can evaluate its
       quality.

This is a problem that can be solved by a provider with a large amount of 
data. For prose, e.g. to search for the original source of a quote or 
syndicated blog post, this problem is in fact mostly solved -- search for 
a representative string from the document, and use your search engine's 
features to focus on the document that first used that phrase. This is a 
harder problem for numbers, though, and is even harder if the mashup 
doesn't include the original data.

     * TV guide listings - If the TV guide provider does not render a link to
       IMDB, the browser should recognise TV shows and give implicit links.

A search engine can be used to search for the show, which will then 
provide links that the user wants relating to that show. This solution has 
the advantage of already being part of a user's daily routine.

     * Students and teachers should be able to discover each other -- both
       within an institution and across institutions -- via their blogging.

Search engines are in a unique position to find pages similar to each 
other, and indeed many search engines have had a "similar pages" feature 
for some time. It seems plausible that tools could be developed that 
specifically search for blogs that appear to be from students or teachers, 
if there is a demand for this.

     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush.

Third party tools do not seem useful for solving this problem (except 
insofar as they could be used to augment the solution described for this 
problem earlier with client-side tools).

In conclusion:

Some of these scenarios are easy to solve with existing technology. Others 
will require advances in natural language processing or other large-corpus 
technologies. None, it seems, of this particular set of use cases are 
particularly well addressed by microdata markup.

For finding information about a page, such as what data an analysis is 
based on, or finding other pages from people with similar interests, when 
the page author isn't really interested in participating in a system that 
would aid this, tools such as search engines seem like the most promising 
solution. These solutions are still immature, though, and much work in 
this area is to be expected going forward.

For using information on a page, such as a TV Show's title, to navigate to 
other sites, such as IMDB, the browser's own UI is probably the best 
starting point. Features such as IE8's Accelerators and Mac OS X's "Search 
in Google" address this use case adequately and in an extensible and 
simple fashion that the user can tune to his or her preferences.

Finally, turning text in multiple languages into machine-processable data 
is a problem that will likely be the continued target of focused research 
both in corporate R&D and in academia, and efforts like Wolfram Alpha 
indicate that this is in fact an area with growing interest.

Thus, I haven't added anything to HTML5 to address the above use cases.

A number of further use cases remain to be examined. I will send further 
e-mail hopefully this week as I address them.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'