[whatwg] Getting data out of poorly written Web pages
Ian Hickson
ian at hixie.ch
Mon May 4 20:01:51 PDT 2009
One of the use cases I collected from the e-mails sent in over the past
few months was the following:
USE CASE: Getting data out of poorly written Web pages, so that the user
can find more information about the page's contents.
SCENARIOS:
* Alfred merges data from various sources in a static manner, generating
a new set of data. Bob later uses this static data in conjunction with
other data sets to generate yet another set of static data. Julie then
visits Bob's page later, and wants to know where and when the various
sources of data Bob used come from, so that she can evaluate its
quality. (In this instance, Alfred and Bob are assumed to be
uncooperative, since creating a static mashup would be an example of a
poorly-written page.)
* TV guide listings - If the TV guide provider does not render a link to
IMDB, the browser should recognise TV shows and give implicit links.
(In this instance, it is assumed that the TV guide provider is
uncooperative, since it isn't providing the links the user wants.)
* Students and teachers should be able to discover each other -- both
within an institution and across institutions -- via their blogging.
(In this instance, it is assumed that the teachers and students aren't
cooperative, since they would otherwise be able to find each other by
listing their blogs in a common directory.)
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush. (In this instance, it is
assumed that the people writing the statements aren't cooperative,
since if they were they could just add the data straight into the
knowledge base.)
REQUIREMENTS:
* Does not need cooperation of the author (if the page author was
cooperative, the page would be well-written).
* Shouldn't require the consumer to write XSLT or server-side code to
derive this information from the page.
One class of the solutions that was proposed to address this is the idea
of getting the author to mark up microdata (small bits of data) in the
page, annotating the information that is needed to complete the scenarios.
Such formats could be RDFa, Microformats, n3, a custom format for HTML5,
or any number of other syntaxes. However, it's not clear that this would
help in this case, since the underlying assumption with these particular
problems is that the author isn't actively cooperating with the user
(likely due to ignorance, of course, not malice).
Let's examine these use cases with a microdata solution in mind:
* Alfred merges data from various sources in a static manner, generating
a new set of data. Bob later uses this static data in conjunction with
other data sets to generate yet another set of static data. Julie then
visits Bob's page later, and wants to know where and when the various
sources of data Bob used come from, so that she can evaluate its
quality.
Here, Julie is two steps removed from the original data. Since we are
assuming here that Alfred and Bob are not cooperating with Julie, we must
also assume that they haven't included this information on the page. If
they haven't included it, then microdata doesn't help, as there is nothing
to mark up.
* TV guide listings - If the TV guide provider does not render a link to
IMDB, the browser should recognise TV shows and give implicit links.
(In this instance, it is assumed that the TV guide provider is
uncooperative, since it isn't providing the links the user wants.)
If the TV guide listing page was cooperative, it would just provide the
links to the IMDB that the user wants. It isn't; we cannot, therefore,
assume that it would be ready to include microdata that would let the user
find the relevant page on the IMDB using a tool that consumes Microdata.
* Students and teachers should be able to discover each other -- both
within an institution and across institutions -- via their blogging.
The obvious solution here is for the students and teachers to simply
register their blogs in a common directory. However, assuming that they
are not even doing that, it is unlikely that they _would_ include some
kind of microdata in their pages to solve the problem.
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush.
If the people writing down their thoughts were to be "hip" enough to write
their thoughts using microdata annotations, they'd also be able to just
add them to the knowledge base directly. So again, we must assume that
this is a case where we can't rely on microdata.
Thus, we have our first requirement:
* Does not need cooperation of the author.
If we take the author out of the equation, there are three other parties
that could help solve this problem:
1. The user.
2. The user's client tool provider (e.g. browser vendor).
3. Third party tool providers (e.g. web sites, search engines).
Relying on the user to solve these problems is somewhat missing the point
of solving the problems, so let's focus on the browser and on other tools.
The other requirement listed above, from someone who presumably wishes to
avoid the user having to do any extra work, is:
* Shouldn't require the consumer to write XSLT or server-side code to
derive this information from the page.
This is worth bearing in mind as we look at how browsers and other tools
might help solve the problem.
First let's look at the scenarios again, from the perspective of the
client software:
* Alfred merges data from various sources in a static manner, generating
a new set of data. Bob later uses this static data in conjunction with
other data sets to generate yet another set of static data. Julie then
visits Bob's page later, and wants to know where and when the various
sources of data Bob used come from, so that she can evaluate its
quality.
>From the browser's point of view, Julie is viewing a Web page with some
data, say, some HTML <table>s, and requests the browser's help in
identifying the source of the data.
It's not clear to me that the browser could do _anything_ at this point
that would solve the problem. Without help from the page, finding the
origin of data is a search problem, and the browser doesn't really have
anywhere to begin from.
* TV guide listings - If the TV guide provider does not render a link to
IMDB, the browser should recognise TV shows and give implicit links.
>From the browser's point of view, the user is visiting a page with various
bits of text on it.
There has been some work in the area of having browsers give implicit
links, but that has historically not been successful at all:
http://en.wikipedia.org/wiki/Smart_tag_(Microsoft)
However, as that Wikipedia page points out, what _has_ been moderately
successful (so far) is the idea of having the browser offer links when the
user selects some text. Thus, if the user is an IMDB user, he could select
the TV show title, and select "IMDB" from the resulting menu.
This solution does solve the problem without XSLT or server-side consumer
code. Thus, this appears to be a solution to this particular scenario.
* Students and teachers should be able to discover each other -- both
within an institution and across institutions -- via their blogging.
Form the browser's point of view, the user is browsing one page, and
desires other pages that are similar in a particular way. Again, this is
fundamentally a search problem, so it's not clear that there's anything
that could be done to address it from the browser.
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush.
Here the client is not a browser, but some other tool, whose job it is to
populate a knowledge base from statements in Spanish and English. Almost
by definition then, it seems like this tool should, as part of its
operation, be able to convert English and Spanish into the knowledge
base's format. Such tools currently are not widely available to the
general public. That's probably ok, though, since to be honest, the
general public is unlikely to make direct use of knowledge bases at this
point anyway.
Whether this requires some code from the user (as opposed to being
automatic) depends on the software, but software that can interpret such
statements (i.e. AI or NLP software) would presumably do so without help
from the user.
Unfortunately, such solutions are somewhat hypothetical at this point.
Thus client software is a possible solution, but not a great one.
Let's look at the scenarios again from the point of view of a third-party
software provider, e.g. a search engine:
* Alfred merges data from various sources in a static manner, generating
a new set of data. Bob later uses this static data in conjunction with
other data sets to generate yet another set of static data. Julie then
visits Bob's page later, and wants to know where and when the various
sources of data Bob used come from, so that she can evaluate its
quality.
This is a problem that can be solved by a provider with a large amount of
data. For prose, e.g. to search for the original source of a quote or
syndicated blog post, this problem is in fact mostly solved -- search for
a representative string from the document, and use your search engine's
features to focus on the document that first used that phrase. This is a
harder problem for numbers, though, and is even harder if the mashup
doesn't include the original data.
* TV guide listings - If the TV guide provider does not render a link to
IMDB, the browser should recognise TV shows and give implicit links.
A search engine can be used to search for the show, which will then
provide links that the user wants relating to that show. This solution has
the advantage of already being part of a user's daily routine.
* Students and teachers should be able to discover each other -- both
within an institution and across institutions -- via their blogging.
Search engines are in a unique position to find pages similar to each
other, and indeed many search engines have had a "similar pages" feature
for some time. It seems plausible that tools could be developed that
specifically search for blogs that appear to be from students or teachers,
if there is a demand for this.
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush.
Third party tools do not seem useful for solving this problem (except
insofar as they could be used to augment the solution described for this
problem earlier with client-side tools).
In conclusion:
Some of these scenarios are easy to solve with existing technology. Others
will require advances in natural language processing or other large-corpus
technologies. None, it seems, of this particular set of use cases are
particularly well addressed by microdata markup.
For finding information about a page, such as what data an analysis is
based on, or finding other pages from people with similar interests, when
the page author isn't really interested in participating in a system that
would aid this, tools such as search engines seem like the most promising
solution. These solutions are still immature, though, and much work in
this area is to be expected going forward.
For using information on a page, such as a TV Show's title, to navigate to
other sites, such as IMDB, the browser's own UI is probably the best
starting point. Features such as IE8's Accelerators and Mac OS X's "Search
in Google" address this use case adequately and in an extensible and
simple fashion that the user can tune to his or her preferences.
Finally, turning text in multiple languages into machine-processable data
is a problem that will likely be the continued target of focused research
both in corporate R&D and in academia, and efforts like Wolfram Alpha
indicate that this is in fact an area with growing interest.
Thus, I haven't added anything to HTML5 to address the above use cases.
A number of further use cases remain to be examined. I will send further
e-mail hopefully this week as I address them.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list