[whatwg] Trying to work out the problems solved by RDFa

Wed Dec 31 20:41:19 PST 2008

Summary:

I believe that there are use cases for RDFa - and that they are precisely  
the sort of thing that Yahoo, Google, Ask, and their ilk are not going to  
be interested in, since they are based on solving problems that those  
search engines do not efficiently solve, such as (among others) using  
private data or dealing with trustworthy data to answer very specific  
questions automatically.

If Ian needs to understand the Semantic Web Industry and why people have  
invested in the RDFa proposal, then it is important to identify the right  
questions, and having him alone identify the sub-questions when he doesn't  
understand the issue isn't going to help him make a well-informed decision.

Some of Ian's questions are discussed here. I cut the mail "short" since I  
think it is already too long for many people, which means that the debate  
will simply pass without their reading or input.

On Wed, 31 Dec 2008 20:46:01 +1100, Ian Hickson <ian at hixie.ch> wrote:

> One of the outstanding issues for HTML5 is the question of whether HTML5
> should solve the problem that RDFa solves, e.g. by embedding RDFa
...
> Before I can determine whether we should solve this problem, and before I
> can evaluate proposals for solving this problem, I need to learn what the
> problem is.
>
> Earlier this year, there was a thread on RDFa on the WHATWG list. Very
> little of the thread focused on describing the problem. This e-mail is an
> attempt to work out what the problem is based on that feedback, on
> discussions at the recent TPAC, and on other research I have done.
>
>
> On Mon, 25 Aug 2008, Manu Sporny wrote:
>> Ian Hickson wrote:
>> > I have no idea what problem RDFa is trying to solve. I have no idea
>> > what the requirements are.
>>
>> Web browsers currently do not understand the meaning behind human
>> statements or concepts on a web page. If web browsers could understand
>> that a particular page was describing a piece of music, a movie, an
>> event, a person or a product, the browser could then help the user find
>> more information about the particular item in question. It would help
>> automate the browsing experience. Not only would the browsing experience
>> be improved, but search engine indexing quality would be better due to a
>> spider's ability to understand the data on the page with more accuracy.
>
> Let's see if I can rephrase that in terms of requirements.
>
> * Web browsers should be able to help users find information related to
>   the items that page they are looking at discusses.
>
> * Search engines should be able to determine the contents of pages with
>   more accuracy than today.
>
> Is that right?
>
> Are those the only requirements/problems that RDFa is attempting to
> address? If not, what other requirements are there?

I don't think so. I think there are some other requirements:

A standard way to include arbitrary data in a web page and extract it for  
machine processing, without having to pre-coordinate their data models.

Since many people use RDF as an interchange, storage and processing format  
for this kind of data (because it provides for automated mapping of data  
 from one schema to many others, without requiring anyone to touch the  
original schemata or agree in advance how they should be created), I  
believe there is a requirement for a method that allows third parties to  
include RDF data in, and extract it from information encoded within an  
HTML page.

>> The Microformats community has done a remarkable job of working on the
>> web semantics problem, creating several different methods of expressing
>> common human concepts (contact information (hCard), events (hCalendar),
>> and audio recordings (hAudio)).
>
> Right; with Microformats, each Microformat has its own problem space and
> thus each one can be evaluated separately. It is much harder to evaluate
> something when the problem space is as generic as it appears RDFa's is.

The point is that there are a very large set of very small problem spaces  
relevant to a small group at a time. Like RDF itself, RDFa is meeting the  
problem of allowing these people to share machine-processable data without  
previously coordinating their approach.

>> The results of the first set of Microformats efforts were some pretty
>> cool applications, like the following one demonstrating how a web
>> browser could forward event information from your PC web browser to your
>> phone via Bluetooth:
>>
>> http://www.youtube.com/watch?v=azoNnLoJi-4
>
> It's a technically very interesting application. What has the adoption
> rate been like? How does it compare to other solutions to the problem,
> like CalDav, iCal, or Microsoft Exchange? Do people publish calendar
> events much? There are a lot of Web-based calendar systems, like MobileMe
> or WebCalendar. Do people expose data on their Web page that can be used
> to import calendar data to these systems?

In some cases this data is indeed exposed to Webpages. However, anecdotal  
evidence (which unfortunately is all that is available when trying to  
study the enormous collections of data in private intranets) suggests that  
this is significantly more valuable when it can be done within a  
restricted access website.

...
>> In short, RDFa addresses the problem of a lack of a standardized
>> semantics expression mechanism in HTML family languages.
>
> A standardized semantics expression mechanism is a solution. The lack of  
> a solution isn't a problem description. What's the problem that a
> standardized semantics expression mechanism solves?

There are many many small problems involving encoding arbitrary data in  
pages - apparently at least enough to convince you that the data-*  
attributes are worth incorporating.

There are many cases where being able to extract that data with a simple  
toolkit from someone else's content, or using someone else's toolkit  
without having to tell them about your data model, solves a local problem.  
The data-* attributes, because they do not represent a formal model that  
can be manipulated, are insufficient to enable sharing of tools which can  
extract arbitrary modelled data.

RDF, in particular, also provides estabished ways of merging existing data  
encoded in different existing schemata.

There are many cases where people build their own dataset and queries to  
solve a local problem. As an example, Opera is not intersted in asking  
Google to index data related to internal developer documents, and use it  
to produce further documentation we need. However, we do automatically  
extract various kinds of data from internal documents and re-use it. While  
Opera does not in fact use the RDF toolstack for that process, there are  
many other large companies and organisations who do, and who would benefit  
 from being able to use RDFa in that process.

>> RDFa not only enables the use cases described in the videos listed
>> above, but all use cases that struggle with enabling web browsers and
>> web spiders understand the context of the current page.
>
> It would be helpful if we could list these use cases clearly and in  
> detail so that we could evaluate the solutions proposed against them.
>
> Here's a list of the use cases and requirements so far in this e-mail:
>
> * Web browsers should be able to help users find information related to
>   the items that page they are looking at discusses.
>
> * Search engines should be able to determine the contents of pages with
>   more accuracy than today.
>
> * Exposing calendar events so that users can add those events to their
>   calendaring systems.
>
> * Exposing music samples on a page so that a user can listen to all the
>   samples.
>
> * Getting data out of poorly written Web pages, so that the user can find
>   more information about the page's contents.
>
> * Finding more information about a movie when looking at a page about the
>   movie, when the page contains detailed data about the movie.
>
> Can we list some more use cases?
>
>
> Here are some other questions that I would like the answers to so that I
> can better understand what is being proposed here:
>
> Does it make sense to solve all these problems with the same syntax?

That depends on the answers to your next two questions.

Moreover, that is not actually a very good question in this case. I think  
the judgement call should be whether a syntax that allows people to solve  
the identified problem set consistently is sufficiently valuable (measured  
in terms of the advantages weighed against the disadvantages) to justify  
being part of HTML5.

> What are the disadvantanges of doing so?

I am not sure.

> What are the advantages?

Many people will be able to use standard tools which are part of their  
existing infrastructure to manipulate important data. They will be able to  
store that data in a visible form, in web pages. They will also be able to  
present the data easily in a form that does not force them to lose  
important semantics.

People will be able to build toolkits that allow for processing of data  
 from webpages without knowing, a priori, the data model used for that  
information.

> What is the
> opportunity cost of encouraging everyone to expose data in the same way?

I don't know. I don't see much of an opportunity cost.

> What is the cost of having different data use specialised formats?

If the data model, or a part of it, is not explicit as in RDF but is  
implicit in code made to treat it (as is the case with using scripts to  
process things stored in arbitrarily named data-* attributes, and is also  
the case in using undocumented or semi-documented XML formats, it requires  
people to understand the code as well as the data model in order to use  
the data. In a corporate situation where hundreds or tens of thousands of  
people are required to work with the same data, this makes the data model  
very fragile.

Such considerations also apply to larger communities, for example those  
dealing with complex scientific information.

> Do publishers actually want to use a common data format?

It would appear so - even in cases where they don't want to publish their  
data in such an easy-to-use format for commercial reasons.

> How have past efforts in creating data formats fared?

Some have been pretty successful. Dublin Core is a general format for  
labelling content that is widely used. MARC records have been very  
successful.

> Are enough data providers actually willing to expose their data in a
> machine readable manner for this to be truly useful?

To make this truly useful it doesn't need to be exposed to the public. It  
would appear that organisations are prepared to make large investments in  
RDF data whether they expose them or not (and some very large ones do  
expose data), which suggests that this data is truly useful.

> If data providers
> will be willing to expose their data as RDFa, why are they not already
> exposing their data in machine-readable form today?
>
>  - For example, why doesn't Amazon expose a CSV file of your usage
>    history, or an Atom feed of the comments for each product, or an
>    hProduct annotated form of their product data? (Or do they? And if so,
>    do we know if users use this data?)

Why would they need to?

>  - As another example, why doesn't Craigslist like their data being  
>    reused in mashups? Would they be willing to allow their users to reuse
>    their data in these new and exciting ways, or would they go out of
>    their way to prevent the data from being accessible as soon as a
>    critical mass of users started using it?

This is a key question. Why *should* a data provider be required to offer  
their product (data) for other people to use, in order to demonstrate that  
the data is useful. Google, a large provider of data, insists on certain  
conditions being met before it makes its services available, and that  
seems perfectly reasonably to me.

Whether Craigslist actively attempts to make their data easier to  
aggregate, or actively avoids facilitating that process, strikes me as  
irrelevant to the question of whether there is value in enabling them to  
do so. Because large organisations specialising in gathering people's  
data, from Flickr to Google and Facebook to Government taxation  
departments are not the only consumers and producers of data that  
determine value for users.

It would seem important that the Web easily enable small-time users of  
data to efficiently communicate with one another, without the need to have  
one of the giants as an intermediary. When libraries in the Dominican  
Republic want to share data, and librarians in Léon want to use that data,  
it seems that the Web should facilitate that without resorting to  
intermediaries like Amazon or Yahoo! and since we already have the  
technology to do so in a way that enables very powerful data models to be  
used without requiring coordination, it seems odd that you don't even  
understand how this could be valuable.

> What will the licensing situation be like for this data? Will the  
> licenses allow for the reuse being proposed to solve the problems and
> use cases listed above?

In some cases yes, and in some cases no. In other words, making such data  
available does not distort natural market conditions one way or another.

> How are Web browsers going to expose user interfaces to answer user
> questions?

I am glad to see that you think user interface behaviour is in fact  
important to the process of specifying HTML (I had been under the  
impression that you believed the spec should not touch on it). There are  
various query systems already available in browsers, from the search  
engine in Opera that lets you do a free-text search on pages stored in  
your history to Tabulator - a substantial RDF browser available as a  
Widget for Opera or as an extension to Firefox, that allows for a variety  
of pre-configured questions as well as free-form questions.

> Can only previously configured, hard-coded questions be asked,
> or will Web browsers be able to answer arbitrary free-form questions from
> users using the data exposed by RDFa?

Both of these are possible. The value of RDFa is that it actually supports  
the possibility of asking free-form questions by using a data model that  
is sufficiently well specified to enable constructions of tools that are  
not dependent on being preconfigured to recognise the exact type of data  
being queried (unlike, say, microformats, which require an intermediate  
agreement to enable people to extract the data, and don't provide for  
merging data of different types for rich queries).

> How are Web browsers that expose this data going to handle data that is
> not exposed in the same format? For example, if a site exposes data in
> JSON or CSV format rather than RDFa, will that data be available to the
> user in the same way?

Who cares? But for those who do, this is up to Web browsers. They can  
choose to implement transformations between some particular CSV data and  
RDFa. The difficulty here (and therefore illustration of the value of  
RDFa) is that CSV data has important details of the meaning of the data  
only available out of band in looking at how the data is recorded, while  
RDF allows for automating the process of merging data originally encoded  
in different RDFa vocabularies.

...

> What is the expected strategy to fight spam in these systems? Is it
> expected that user agents will just collect data in the background? If  
> so, how are user agents expected to distinguish between pages that have
> reliable data and pages that expose data that is misleading or wrong?

Aggregating data in real-time is relatively expensive, so is a strategy  
more suited to dealing with asking new questions. Typical systems so far  
have aggregated data in the background to deal with known queries (one  
example is Google, which crawls pages in advance, anticipating searches  
that match terms against the content of those pages), and use live  
querying for cases where the result cannot reliably be stored (e.g.  
airline reservation systems like TravelJungle or LastMinute which  
determine price and availability based on constantly changing data).

Different use cases will imply different strategies for fighting spam.  
Some obvious ones are to rely on trusted sites and secured and signed  
data, to use reputation managers, to follow the "shape" of data over time  
so that anamolies can be highlighted and checked more carefully (in the  
manner of Bayesian filters for email). Some use cases don't care much  
about spam, or are not very interesting to spammers. Some use cases are  
private data anyway.

>  - Systems like Yahoo! Search and Live Search expend extraordinary  
>    amounts of resources on spam fighting technology; such technology
>    would not be accessible to Web browsers unless they interacted with
>    anti-spam services much like browsers today interact with
>    anti-phishing services.

Actually, at least Opera already incorporates anti-spam technology in its  
mail client. Where browsers are the primary consumers of data there is  
nothing at all to suggest that they cannot incorporate anti-spam  
technology directly. (Indeed, the POWDER specification is designed in part  
to make that easy - and it is exactly the sort of data that might  
sometimes be usefully encoded in RDFa since it is based on an RDF model).

>    Yet anti-phishing services have been controversial, since they involve
>    exposing the user's browsing history to third parties; anti-spam
>    services would be a significantly greater problem due to the vastly
>    greater level of spamming compared to phishing. What is the solution
>    proposed to tackle this problem?

It is not clear that this problem is any different in the context of RDFa  
to the general problem already faced by the Web. In general, the solutions  
proposed are the same as those already used on the Web, and of course  
those in development.

>  - Even with a mechanism to distinguish trusted sites from spammy sites,
>    how would Web browsers deal with trusted sites that have been subject
>    to spamming attacks? This is common, for instance, on blogs or wikis.

Right. But that doesn't mean we question whether browsers should enable  
blogs or wikis. Why would RDFa data be different enough to make this  
question relevant?

> These are not rhetorical questions, and I don't know the answers to them.

Some of them seem to be poorly phrased, although if you don't understand  
why people have been working on this technology and why they think it  
would be valuable to have it available in HTML I guess that is almost  
inevitable.

> We need detailed answers to all those questions before we can really
> evaluate the various proposals that have been made here.

No, we apparently need you to personally understand the Semantic Web  
Industry. Determining answers to the questions which are important is  
probably helpful, but also helpful is explaining when your questions are  
irrelevant because they are based on a lack of understanding. This is not  
intended as a slight, but to clarify the process required to have  
something as large as the "Sematic Web" (capital letters, implying the  
whole W3C activity, the industry based around RDF, and so on) evaluated  
for potential inclusion in the HTML5 specification.

I presume the same would apply if the "Web Services" people came and asked  
to have all of their things included in HTML, and offered a specification  
that could be used to achieve their desires.
...

[not clear what the context was here, so citing as it was]
>> > I don't think more metadata is going to improve search engines. In
>> > practice, metadata is so highly gamed that it cannot be relied upon.
>> > In fact, search engines probably already "understand" pages with far
>> > more accuracy than most authors will ever be able to express.
>>
>> You are correct, more erroneous metadata is not going to improve search
>> engines. More /accurate/ metadata, however, IS going to improve search
>> engines. Nobody is going to argue that the system could not be gamed. I
>> can guarantee that it will be gamed.
>>
>> However, that's the reality that we have to live with when introducing
>> any new web-based technology. It will be mis-used, abused and corrupted.
>> The question is, will it do more good than harm? In the case of RDFa
>> /and/ Microformats, we do think it will do more good than harm.
>
> For search engines, I am not convinced. Google's experience is that
> natural language processing of the actual information seen by the actual
> end user is far, far more reliable than any source of metadata. Thus from
> Google's perspective, investing in RDFa seems like a poorer investment
> than investing in natural language processing.

Indeed. But Google is something of an edge case, since they can afford to  
run a huge organisation with massive computer power and many engineers to  
address a problem where a "near-enough" solution brings themn the users  
who are in turn the product they sell to advertisers. There are many other  
use cases where a small group of people want a way to reliably search  
trusted data.

 From global virtual library systems to a single websites, there are many  
others who find that processing structured data is more efficient for  
their needs than doing free-text analysis of web pages (something that  
they effectively contract out to Google, Ask, Yahoo! and their many  
competitors who specialise in it). Some of these are the people whe have  
decided that investing in RDFa is a far more valuable exercis than trying  
to out-invest Google in natural language processing.

This email is already too long for most people to get through it :( I  
believe that this discussion is going to last for some time (I cannot  
imagine why, given the HTML timeline, it would need to be resolved before  
June), so there will be time for others to discuss more fully the many  
points Ian raises as ones he would like to understand.

cheers

Chaals

-- 
Charles McCathieNevile  Opera Software, Standards Group
     je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com