[whatwg] Trying to work out the problems solved by RDFa

Wed Dec 31 01:46:01 PST 2008

One of the outstanding issues for HTML5 is the question of whether HTML5 
should solve the problem that RDFa solves, e.g. by embedding RDFa straight 
into HTML5, or by some other method.

Before I can determine whether we should solve this problem, and before I 
can evaluate proposals for solving this problem, I need to learn what the 
problem is.

Earlier this year, there was a thread on RDFa on the WHATWG list. Very 
little of the thread focused on describing the problem. This e-mail is an 
attempt to work out what the problem is based on that feedback, on 
discussions at the recent TPAC, and on other research I have done.

On Mon, 25 Aug 2008, Manu Sporny wrote:
> Ian Hickson wrote:
> > I have no idea what problem RDFa is trying to solve. I have no idea 
> > what the requirements are.
> 
> Web browsers currently do not understand the meaning behind human 
> statements or concepts on a web page. If web browsers could understand 
> that a particular page was describing a piece of music, a movie, an 
> event, a person or a product, the browser could then help the user find 
> more information about the particular item in question. It would help 
> automate the browsing experience. Not only would the browsing experience 
> be improved, but search engine indexing quality would be better due to a 
> spider's ability to understand the data on the page with more accuracy.

Let's see if I can rephrase that in terms of requirements.

* Web browsers should be able to help users find information related to 
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with 
  more accuracy than today.

Is that right?

Are those the only requirements/problems that RDFa is attempting to 
address? If not, what other requirements are there?

> The Microformats community has done a remarkable job of working on the 
> web semantics problem, creating several different methods of expressing 
> common human concepts (contact information (hCard), events (hCalendar), 
> and audio recordings (hAudio)).

Right; with Microformats, each Microformat has its own problem space and 
thus each one can be evaluated separately. It is much harder to evaluate 
something when the problem space is as generic as it appears RDFa's is.

> The results of the first set of Microformats efforts were some pretty 
> cool applications, like the following one demonstrating how a web 
> browser could forward event information from your PC web browser to your 
> phone via Bluetooth:
> 
> http://www.youtube.com/watch?v=azoNnLoJi-4

It's a technically very interesting application. What has the adoption 
rate been like? How does it compare to other solutions to the problem, 
like CalDav, iCal, or Microsoft Exchange? Do people publish calendar 
events much? There are a lot of Web-based calendar systems, like MobileMe 
or WebCalendar. Do people expose data on their Web page that can be used 
to import calendar data to these systems?

> Here is another demonstration of how one could use music metadata 
> embedded in a web page to find more information about your favorite 
> band:
> 
> http://www.youtube.com/watch?v=oPWNgZ4peuI

There are two main demos in that video.

The first one shows a way to solve the problem of getting all the sample 
tracks from a bitmunk page. Here are the steps that the video shows:

 * Go to the bitmunk Web page.
 * Notice that the Web page has a music note icon in the location bar.
 * Click that icon, and then select the album from the drop down menu.
 * Click the "Get Sample" button on the auto-generated dialog.

Here are the steps that users do today to solve the same problem:

 * Go to the bitmunk Web page.
 * Click the "Play all samples" link.

The second demo shows how to solve the problem of getting data out of a 
poorly written page. However, the example seems contrived; why would an 
author manage to write accurate RDFa statements but fail so utterly to 
write a usable Web page otherwise?

Also, the example goes on to show how given some RDFa, one can do a custom 
search on another site without having to type in any search keywords. But 
that is already possible without RDFa; for example, one can select any 
text on Mac OS X and search for that string in Google ([Start Wearing 
Purple] returns a number of hits for lyrics, videos, tabs, etc about the 
song; [Start Wearing Purple Gogol] returns even more). IE8 has even more 
detailed features along these lines: select some text and you get an 
"accelerator" menu which can be extended to include whatever searches or 
tools you want to use.

So it's not clear that RDFa solves this particular problem better than 
other existing solutions, and in particular, it is not clear that in the 
case actually put forwards by that video -- namely, a poorly written page 
-- that RDFa would be able to solve the problem at all, whereas the other 
solutions of today would not be hampered by poor markup.

> or how one could use movie metadata on a web page to find more 
> information about a movie:
> 
> http://www.youtube.com/watch?v=PVGD9HQloDI

The problems shown in that video -- finding out more about a movie -- can 
already be solved today with just as much success simply by using search 
engines. Indeed search engines will typically do a better job -- not only 
do they not need any additional markup to be usable, but they aren't 
hard-coded to particular sites, so if a particular movie is better 
represented by a Wikipedia page, that will be shown, but if IMDB does a 
better job of another, that will be returned instead. The user doesn't 
have to guess which site to use, and the browser doesn't have to have a 
near-infinite list of Web sites for each topic, instead, only general 
purpose search engines need be supported.

> The Mozilla Labs Aurora demos also show that semantic web markup is 
> necessary in order to execute upon some of the ideas demonstrated in 
> their future browsers project: [...]

I don't agree that these videos show that any special markup is necessary 
beyond things like <table>. The majority of the solutions proposed in that 
video actually boil down to exposing APIs, not data; the data that is 
manipiulated tends to be in the form of raw data, not annotations on Web 
pages. Thus the technical solutions that would address the problems 
suggested by the Aurora series would probably argue more for dedicated 
data formats, with mechanisms for publishing and subscribing to URIs based 
on the APIs they provide, rather than anything like Microformats or RDFa.

(Also, I'd want to see usability studies on these ideas before really 
basing anything in HTML5 on them.)

> In short, RDFa addresses the problem of a lack of a standardized 
> semantics expression mechanism in HTML family languages.

A standardized semantics expression mechanism is a solution. The lack of a 
solution isn't a problem description. What's the problem that a 
standardized semantics expression mechanism solves?

> RDFa not only enables the use cases described in the videos listed 
> above, but all use cases that struggle with enabling web browsers and 
> web spiders understand the context of the current page.

It would be helpful if we could list these use cases clearly and in detail 
so that we could evaluate the solutions proposed against them.

Here's a list of the use cases and requirements so far in this e-mail:

* Web browsers should be able to help users find information related to 
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with 
  more accuracy than today.

* Exposing calendar events so that users can add those events to their 
  calendaring systems.

* Exposing music samples on a page so that a user can listen to all the 
  samples.

* Getting data out of poorly written Web pages, so that the user can find 
  more information about the page's contents.

* Finding more information about a movie when looking at a page about the
  movie, when the page contains detailed data about the movie.

Can we list some more use cases?

Here are some other questions that I would like the answers to so that I 
can better understand what is being proposed here:

Does it make sense to solve all these problems with the same syntax? What 
are the disadvantanges of doing so? What are the advantages? What is the 
opportunity cost of encouraging everyone to expose data in the same way? 
What is the cost of having different data use specialised formats?

Do publishers actually want to use a common data format? How have past 
efforts in creating data formats fared?

Are enough data providers actually willing to expose their data in a 
machine readable manner for this to be truly useful? If data providers 
will be willing to expose their data as RDFa, why are they not already 
exposing their data in machine-readable form today?

 - For example, why doesn't Amazon expose a CSV file of your usage 
   history, or an Atom feed of the comments for each product, or an 
   hProduct annotated form of their product data? (Or do they? And if so, 
   do we know if users use this data?)

 - As another example, why doesn't Craigslist like their data being reused 
   in mashups? Would they be willing to allow their users to reuse their 
   data in these new and exciting ways, or would they go out of their way 
   to prevent the data from being accessible as soon as a critical mass of 
   users started using it?

 - Would the people contributing to Wikipedia be willing to annotate their 
   edits using a structured data annotation syntax? Would they understand 
   what this meant? If not, is the DBpedia enough?

What will the licensing situation be like for this data? Will the licenses 
allow for the reuse being proposed to solve the problems and use cases 
listed above?

How are Web browsers going to expose user interfaces to answer user 
questions? Can only previously configured, hard-coded questions be asked, 
or will Web browsers be able to answer arbitrary free-form questions from 
users using the data exposed by RDFa?

How are Web browsers that expose this data going to handle data that is 
not exposed in the same format? For example, if a site exposes data in 
JSON or CSV format rather than RDFa, will that data be available to the 
user in the same way?

What data will a user be able to interact with? The examples above didn't 
really show any data that the user couldn't access now just using HTML 
without any help from the Web browser; are we expecting Web browsers to 
start exposing more data than that? What will the interface be for this 
data? How will that interface be generated? Is that interface usable?

What is the expected strategy to fight spam in these systems? Is it 
expected that user agents will just collect data in the background? If so, 
how are user agents expected to distinguish between pages that have 
reliable data and pages that expose data that is misleading or wrong? 

 - Systems like Yahoo! Search and Live Search expend extraordinary amounts 
   of resources on spam fighting technology; such technology would not be 
   accessible to Web browsers unless they interacted with anti-spam 
   services much like browsers today interact with anti-phishing services. 
   Yet anti-phishing services have been controversial, since they involve 
   exposing the user's browsing history to third parties; anti-spam 
   services would be a significantly greater problem due to the vastly 
   greater level of spamming compared to phishing. What is the solution 
   proposed to tackle this problem?

 - Even with a mechanism to distinguish trusted sites from spammy sites, 
   how would Web browsers deal with trusted sites that have been subject 
   to spamming attacks? This is common, for instance, on blogs or wikis.

These are not rhetorical questions, and I don't know the answers to them. 
We need detailed answers to all those questions before we can really 
evaluate the various proposals that have been made here.

On Tue, 26 Aug 2008, Ben Adida wrote:
> 
> Here's one example. This is not the only way that RDFa can be helpful, 
> but it should help make things more concrete:
> 
>   http://developer.yahoo.com/searchmonkey/
> 
> Using semantic markup in HTML (microformats and, soon, RDFa), you, as a 
> publisher, can choose to surface more relevant information straight into 
> Yahoo search results.

This doesn't seem to require RDFa or any generic data syntax at all. Since 
the system is site-specific anyway (you have to list the URLs you wish to 
act against), the same kind of mechanism could be done by just extracting 
the data straight out of the page. This would have the advantage of 
working with any Web page without requiring the page to be written using a 
particular syntax.

However, if SearchMonkey is an example of a use case, then we should 
determine the requirements for this feature. It seems, based on reading 
the documentation, that it basically boils down to:

 * Pages should be able to expose nested lists of name-value pairs on a 
   page-by-page basis.

 * It should be possible to define globally-unique names, but the syntax 
   should be optimised for a set of predefined vocabularies.

 * Adding this data to a page should be easy.

 * The syntax for adding this data should encourage the data to remain 
   accurate when the page is changed.

 * The syntax should be resilient to intentional copy-and-paste authoring:
   people copying data into the page from a page that already has data 
   should not have to know about any declarations far from the data.

 * The syntax should be resilient to unintentional copy-and-paste 
   authoring: people copying markup from the page who do not know about 
   these features should not inadvertently mark up their page with
   inapplicable data.

Are there any other requirements that we can derive from SearchMonkey?

In the context of interacting with Amazon:

On Tue, 26 Aug 2008, Manu Sporny wrote:
> 
> 1. "Computer, find more information on this artist."
> 
> 2. "Computer, find the cheapest price for this musical track."
> 
> 3. "Computer, find a popular blog talking about this album."
> 
> 4. "Computer, what other artists has this artist worked with?"
> 
> 5. "Computer, is this a popular track?"

How does the computer expose the UI for these questions? Is it free-form, 
natural language queries? Can you show me an example of software that my 
partner could use which would allow my partner to ask the computer these 
questions?

> Without some form a semantic markup, the computer cannot answer any of 
> those questions for the user.

That's not true. I can do #1 and #3 trivially using a Google search; I can 
do #5 trivially by just looking at the Amazon page itself. #2 can be done 
with a product search on any number of product search engines, and I'm not 
at all convinced that what has been described so far for RDFa could 
actually answer that question anyway (if it could, could you please walk 
me through the exact steps by which the computer would find an answer?). 
I don't know how I would answer #4 today (for movies I would use IMDB's 
specialised search); how would an RDFa Web browser be able to answer that 
question?

> 6. "Computer, I'd like to go to see this movie this week, what times is
>    it playing at the Megaplex 30 that doesn't conflict with my events in 
>    my Google Calendar?"

Is this something that people want to do? I asked my partner sitting next 
to me and the reply was that while they often check sites for movie show 
times, they hadn't considered asking the computer to avoid clashes with 
calendar events -- they would just have the calendar open and not pick a 
clashing time. I personally would tend to do the same thing -- my calendar 
is full of things that can be moved around, and I wouldn't expect the 
computer to know how much time I would need around my other events to make 
it from them to the movie theatre.

Is this something that computers will be able to expose in a generic way? 
It's certainly possible technically for a computer with all the relevant 
information to answer that question, but how do we expect this question to 
be exposed to the user? How do we expect the computer to gather the 
relevant information? Is the constraint to be solved by the user's Web 
browser or by one of the sites involved?

> > It would be helpful if you could walk me through some examples of what 
> > UI you are envisaging in terms of "helping the user find more 
> > information".
> 
> A first cut at a UI can be found in the Fuzzbot demo videos:
> 
> http://www.youtube.com/watch?v=oPWNgZ4peuI 
> http://www.youtube.com/watch?v=PVGD9HQloDI
> 
> However, it's rough and we have spent a total of 4 days on the UI for 
> expressing semantics on a page. Operator is another example of a UI that 
> does semantics detection and display now:
> 
> http://www.youtube.com/watch?v=Kjp4BaJOd0M
> 
> [and see also Aurora]

If we are to add this feature to HTML, we need to have clear evidence that 
this is not still at the research project stage. We don't add features 
because we think that one day maybe people will be able to use them; we 
add features once they have been proved to be workable and proved to be 
useful. This isn't meant as any kind of slight against RDFa, in fact, it 
is intended to safeguard against the feature becoming unusable.

The problem is that if we expose a feature widely before the state of the 
art is able to take advantage of it, that the majority of the usage will 
end up being bogus and wrong, which will poison the feature and basically 
mean we can never use it. As an example of this, see longdesc="" on the 
<img> element -- because it was not widely exposed, pages that use it are 
almost uniformly using it in a pointless or wrong way. The result is that 
the feature is basically unusable and we've had to drop it in HTML5.

If RDFa isn't ready for primetime -- which it seems it isn't, if the UI 
aspect is so immature, as you suggest -- then it is in RDFa's best 
interests to not have RDFa see wide deployment yet.

> > Why is Safari's "select text and then right click to search on Google" 
> > not good enough?
> 
> The browser has no understanding of the text in that approach. This is a 
> fundamental difference between a regular web page and one that contains 
> embedded semantics. The computer doesn't know how to deal with the 
> plain-text example in any particular way... other than asking the user 
> "What should I do with this amorphous blob of text you just 
> highlighted?"

Right, but it still seems to work pretty well. Why does the computer need 
to know how to deal with plain text if Google does search a good job? (Or 
any other search engine, I only single out Google here because it's the 
one that Mac OS X exposes.)

Look at IE8's "accelerators" -- they work on unstructured plain text, yet 
they seem to work just fine.

> A page with semantics allows the browser to, in the very least, give the 
> user an option of web services/sites that match the type of data being 
> manipulated.

In practice, it seems users interact with few enough services that just 
exposing them all all the time works reasonably well.

Exposing them all has another advantage, which is that if the page isn't 
marked up with these annotations, it all still works -- the user doesn't 
need to hope the album title is marked up as such, they just highlight the 
album title and invoke their "music search" accelerator.

> > Have any usability studies been made to test these ideas? (For 
> > example, paper prototype usability studies?) What were the results?
> 
> Yes/maybe to the first two questions - there is frequent feedback to 
> Mike Kaply and us on how to improve the UIs for Operator and Fuzzbot, 
> respectively.

I don't mean feedback; I mean actual actual usability studies with 
inexperienced users done by a usability researcher.

> However - the UI ideas are quite different from the fundamental concept 
> of marking up semantic data. While we can talk about the UIs and dream a 
> little, it will be very hard to get to the UI stage unless there is some 
> way to express semantics in HTML5.

I disagree; it's far easier to fake the data and to experiment with the UI 
aspect than it is to fake the UI and experiment with the data aspect. 
Without the state of the art on the UI side being something we expect 
users to understand and use, and without it being something that actually 
exposes the ability to ask the questions that are being used to justify 
adding these features, we really shouldn't add the data side -- because 
then how can we know if the data side is actually what we need?

> As for the results, those are ongoing. People are downloading and using 
> Operator and Fuzzbot. My guess is that they are being used mostly as a 
> curiosity at this point - no REAL work is getting done using those 
> plug-ins since the future is uncertain for web semantics. It always is 
> until a standard is finalized and a use for that standard is identified. 
> These are the early days, however - nobody is quite sure what the ideal 
> user experience is yet.

This is backwards from how most features are being added to HTML5 -- take 
<video>, for instance, which was widely implemented using Flash before we 
added the feature to HTML5; or <datagrid>, which had a number of 
implementations in JavaScript before we considered adding it to HTML5.

> No amount of polishing is going to turn the steaming pile of web 
> semantics that we have today into the semantic web that we know can 
> exist with the proper architecture in place.

Do we _know_ it can exist, or do we _hope_ it can exist? It takes more 
than architecture to get something like this usefully deployed; we also 
need buy-in from authors, for instance.

> >> Not only would the browsing experience be improved, but search engine 
> >> indexing quality would be better due to a spider's ability to 
> >> understand the data on the page with more accuracy.
> > 
> > This I can speak to directly, since I work for a search engine and 
> > have learnt quite a bit about how it works.
> > 
> > I don't think more metadata is going to improve search engines. In 
> > practice, metadata is so highly gamed that it cannot be relied upon. 
> > In fact, search engines probably already "understand" pages with far 
> > more accuracy than most authors will ever be able to express.
> 
> You are correct, more erroneous metadata is not going to improve search 
> engines. More /accurate/ metadata, however, IS going to improve search 
> engines. Nobody is going to argue that the system could not be gamed. I 
> can guarantee that it will be gamed.
> 
> However, that's the reality that we have to live with when introducing 
> any new web-based technology. It will be mis-used, abused and corrupted. 
> The question is, will it do more good than harm? In the case of RDFa 
> /and/ Microformats, we do think it will do more good than harm.

For search engines, I am not convinced. Google's experience is that 
natural language processing of the actual information seen by the actual 
end user is far, far more reliable than any source of metadata. Thus from 
Google's perspective, investing in RDFa seems like a poorer investment 
than investing in natural language processing.

> We have put a great deal of thought into anti-gaming strategies for 
> search engines with regards to the semantic web. Most of them follow the 
> same principles that Google, Yahoo and others use to prevent link-based 
> and keyword-based gaming strategies.

Could you elaborate on this?

> What I, and many others in the semantic web communities, do think is 
> that there are a number of compelling use cases for a method of semantic 
> expression in HTML. I think documenting those use cases would be a more 
> effective use of everybody's time. What are your thoughts on that 
> strategy?

I think that would be very helpful.

> >> If we are to automate the browsing experience and deliver a more 
> >> usable web experience, we must provide a mechanism for describing, 
> >> detecting and processing semantics.
> > 
> > This statement seems obvious, but actually I disagree with it. It is 
> > not the case the providing a mechanism for describing, detecting, and 
> > processing semantics is the only way to let browsers understand the 
> > meaning behind human statements or concepts on a web page. In fact, I 
> > would argue it's not even the the most plausible solution.
> > 
> > A mechanism for describing, detecting, and processing semantics; that 
> > is, new syntax, new vocabularies, new authoring requirements, 
> > fundamentally relies on authors actually writing the information using 
> > this new syntax.
> 
> I don't believe it does - case in point: My Space, Facebook, Flickr, 
> Google Maps, Google Calendar, LinkedIn. Those are all examples of 
> websites where the users don't write a bit of code, but instead use 
> interfaces to add people, places, events, photos, locations and a 
> complex web of links between each concept without writing any code.

The authors of those sites (the people at those companies) still have to 
actually expose that information using the new syntax.

> Neither RDFa nor Microformats force authors to use the new syntax or 
> vocabularies if they do not want to do so. If the author doesn't care 
> about semantics, they don't have to use the RDFa-specific properties.

If they don't, then we haven't solved the problem of letting browsers 
understand the meaning behind human statements or concepts on that page.

> > If there's anything we can learn from the Web today, however, it is 
> > that authors will reliably output garbage at the syntactic level. They 
> > misuse HTML semantics and syntax uniformly (to the point where 90%+ of 
> > pages are invalid in some way). Use of metadata mechanisms is at a 
> > pitifully low level, and when used is inaccurate (Content-Type headers 
> > for non-HTML data and character encoding declarations for all text 
> > types are both widely wrong, to the point where browsers have 
> > increasingly complex heuristics to work around the errors). Even 
> > "successful" formats for metadata publishing like hCard have woefully 
> > low penetration.
> 
> Yes, I agree with you on all points.
> 
> > Yet, for us to automate the browsing experience by having computers 
> > understand the Web, for us to have search engines be significantly 
> > more accurate by understanding pages, the metadata has to be 
> > widespread, detailed, and reliable.
> 
> I agree that it has to be reliable, but not that the metadata has to be 
> widespread or that detailed.

If it doesn't have to be widespread or detailed, then why do we need to 
support it in HTML5? If our target audience is only authors who do the 
right thing, then an embedded block of n3 is quite adequate.

> The use cases that are enabled by merely having the type and title of a 
> creative work are profound.

Could you elaborate on these? I have yet to see any profound use cases 
solved by exposing that data.

> I can go into detail on this as well, if this community would like to 
> hear about it?

Yes, please. What problems are solved is the most important information to 
have to be able to evaluate the solutions.

> Getting past the inherent greed and evilness of hostile authors is 
> something that many standards on the web deal with - how is HTML5 or 
> XHTML2 going to deal with hostile authors? Blackhat SEOs? People that 
> don't know better?

HTML5 has quite an elaborate security model to deal with hostile content. 
I can discuss the mechanisms used for each feature if you like, but 
basically each feature is designed around the various attack models we can 
imagine. We've rejected several feature on the basis that they can't be 
made secure against hostile authors.

> If the standards we are creating need to get past the inherent greed and 
> evilness of a small minority of the world, then we are all surely 
> doomed. It is a good thing that most of us are optimists here, otherwise 
> nothing like HTML5 would have ever been started in the first place!

We have to be realistic as well as optimistic. Richard Feynman once said 
"For a successful technology, reality must take precedence over public 
relations, for Nature cannot be fooled". The same applies here, though 
replacing public relations with optimism.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'