[whatwg] RDFa is to structured data, like canvas is to bitmap and SVG is to vector

Sat Jan 17 08:55:01 PST 2009

The debate about RDFa highlights a disconnect in the decision making 
related to HTML5.

The purpose behind RDFa is to provide a way to embed complex information 
into a web document, in such a way that a machine can extract this 
information and combine it with other data extracted from other web 
pages. It is not a way to document private data, or data that is meant 
to be used by some JavaScript-based application. The sole purpose of the 
data is for external extraction and combination.

An earlier email between Martin Atkins and Ian Hickson had the following:

"On Sun, 11 Jan 2009, Martin Atkins wrote:
 >
 > One problem this can solve is that an agent can, given a URL that
 > represents a person, extract some basic profile information such as the
 > person's name along with references to other people that person knows.
 > This can further be applied to allow a user who provides his own URL
 > (for example, by signing in via OpenID) to bootstrap his account from
 > existing published data rather than having to re-enter it.
 >
 > So, to distill that into a list of requirements:
 >
 > - Allow software agents to extract profile information for a person 
as often
 > exposed on social networking sites from a page that "represents" that 
person.
 >
 > - Allow software agents to determine who a person lists as their friends
 > given a page that "represents" that person.
 >
 > - Allow the above to be encoded without duplicating the data in both
 > machine-readable and human-readable forms.
 >
 > Is this the sort of thing you're looking for, Ian?

Yes, the above is perfect. (I cut out the bits that weren't really "the
problem" from the quote above -- the above is what I'm looking for.)

The most critical part is "allow a user who provides his own URL to
bootstrap his account from existing published data rather than having to
re-enter it". The one thing I would add would be a scenario that one would
like to be able to play out, so that we can see if our solution would
enable that scenario.

For example:

   "I have an account on social networking site A. I go to a new social
   networking site B. I want to be able to automatically add all my
   friends from site A to site B."

There are presumably other requirements, e.g. "site B must not ask the
user for the user's credentials for site A" (since that would train people
to be susceptible to phishing attacks). Also, "site A must not publish the
data in a manner that allows unrelated users to obtain privacy-sensitive
data about the user", for example we don't want to let other users
determine relationships that the user has intentionally kept secret [1].

It's important that we have these scenarios so that we can check if the
solutions we consider are actually able to solve these problems, these
scenarios, within the constraints and requirements we have."

It would seem that Ian agrees with a need to both a) provide a way to 
document complex information in a consistent, machine readable form and 
that b) the purpose of this data is for external consumption, rather 
than internal use. Where the disconnect comes in is he believes that 
RDF, and the web page serialization technique, RDFa, are only one of a 
set of possible solutions.

Yet at the same time, he references how the MathML and SVG people 
provide sufficient use cases to justify the inclusion of both of these 
into HTML5. But what is MathML. What does it solve? A way to include 
mathematical formula into a document in a formatted manner. What is SVG? 
A way to embed vector graphics into a web page, in such a way that the 
individual elements described by the graphics can become part of the 
overall DOM.

So, why accept that we have to use MathML in order to solve the problems 
of formatting mathematical formula? Why not start from scratch, and 
devise a new approach?

So, why accept that we have to use SVG in order to solve the problems of 
vector graphics? Why not start from scratch, and devise a new approach?

Come to think of it, I think we should also question the use of the 
canvas element. After all, if the problem set is that we need the 
ability to animate graphics in a web page using a non-proprietary 
technology, then wouldn't something like SVG work for this purpose? 
Isn't the canvas element redundant? But then, perhaps we should start 
over from the beginning and just create a new graphics capability from 
scratch, and reject both canvas and SVG.

We don't reject MathML, though. Neither do we reject SVG or canvas. Or 
any other of a number of entities being included in HTML5, including 
SQL. Why? Because they have a history of use, extensive documentation as 
to purpose and behavior, and there are a considerable number of 
implementations that support the specifications. It doesn't make sense 
to start from scratch. It makes more sense to make use of what already 
works.

I have to ask, then: why do we isolate RDF, and RDFa for special 
handling? If we can accept that SQL is a natural database query 
mechanism, and SVG is a natural for vector graphics, and the canvas 
element is the proper choice for a script-enabled bitmaps, and 
MathML...well, you get the picture-if we can accept that these mature, 
well documented representatives of each of their genres as the de facto 
implementation, enough to incorporate each into HTML5, why then do we 
demand that RDF and its web page serialization technique, RDFa, must 
"prove" themselves, when we don't demand the same from other external 
objects and specifications?

To do so is not consistent. To continue to do so demonstrates that 
perhaps other issues are at play in regards to RDF/RDFa.

Martin provided a use case that Ian acknowledges is justified. Ipso 
facto, we do not need to continue providing use cases for this type of 
requirement. We have established that the requirement/need/desire to 
incorporate data into a web page that is consistently machine readable, 
which can be consistently extracted, and consistently combined with data 
from other documents using automated processes is a legitimate need. RDF 
was designed specifically for this purpose, is a mature specification, 
with extensive documentation, and one can find many different 
implementations of its use. The use of RDF for FOAF is just one of many 
uses, RSS 1.0 was another, and a version of RDF embedded within photos, 
CC licensing--these are all based on the same model.

In other words, if we accept that SVG is the de facto implementation of 
vector graphics (as compared to something such as, say, VML), and we 
accept the same for MathML, the canvas element, SQL, and so on, to not 
accept RDF as the de facto implementation for the purpose behind which 
it was designed, is to single out RDF/RDFa for "special handling" within 
the group. To demand more from it, then has been demanded from any other 
element included in HTML5.

In particular, as has been documented elsewhere, very little is needed 
to support RDFa within HTML5. The requirements are much less than those 
for the canvas element, SVG, MathML, and even SQL. So the task, itself, 
is not daunting. Not as daunting as, say, the alt attribute.

This then returns us to my earlier supposition: To not support RDF/RDFa 
as the de facto implementation of complex, structured data is not 
consistent. To continue to do so demonstrates that perhaps other issues 
are at play in regards to RDF/RDFa. Such inconsistencies are not in the 
best interest when developing a new specification meant for widespread 
use on the web. If, as I believe, the inconsistency reflects an 
underlying bias against the concept behind RDF, which is that true web 
semantics is based on structured data, not natural language processing, 
or not exclusively based on natural language processing, then I believe 
it's important to highlight such bias, and deal with it accordingly.

Shelley