[whatwg] Trying to work out the problems solved by RDFa

Fri Jan 2 08:49:02 PST 2009

On Wed, Dec 31, 2008 at 10:41 PM, Charles McCathieNevile
<chaals at opera.com> wrote:
> A standard way to include arbitrary data in a web page and extract it for
> machine processing, without having to pre-coordinate their data models.

This isn't a requirement (or in other words, a problem), it's a
solution.  What are the problems that need to be solved, and for which
having a standard way to include arbitrary data in a web page and have
it easily extractable would be helpful?  (Note:  I think there
certainly *are* problems that *would* find this helpful, I'm just
trying to lead your argument into the right direction.)  (As well,
since the discussion is about RDFa specifically, not data-markup in
general, what are the problems that need RDFa *specifically* as a
solution, as compared to the myriad other ways to embed data?)

> Since many people use RDF as an interchange, storage and processing format
> for this kind of data (because it provides for automated mapping of data
> from one schema to many others, without requiring anyone to touch the
> original schemata or agree in advance how they should be created), I believe
> there is a requirement for a method that allows third parties to include RDF
> data in, and extract it from information encoded within an HTML page.

Solutions for this already exist; embedded N3 in a <script> tag, just
to name something that Ian already mentioned, allows you to mash RDF
data into a page in a machine-extractable way, and brings in any of
the specific ancillary benefits of RDF.

>>> The Microformats community has done a remarkable job of working on the
>>> web semantics problem, creating several different methods of expressing
>>> common human concepts (contact information (hCard), events (hCalendar),
>>> and audio recordings (hAudio)).
>>
>> Right; with Microformats, each Microformat has its own problem space and
>> thus each one can be evaluated separately. It is much harder to evaluate
>> something when the problem space is as generic as it appears RDFa's is.
>
> The point is that there are a very large set of very small problem spaces
> relevant to a small group at a time. Like RDF itself, RDFa is meeting the
> problem of allowing these people to share machine-processable data without
> previously coordinating their approach.

Not quite correct.  Again, the problem of embedded shareable data in a
web page has been solved multiple times.  The specific problem of
sharing *RDF* data (due to needing/wanting the specific benefits RDF
can offer) has also been solved.  What are the precise problems that
require *RDFa* as a solution?

(I won't belabor this point, though it could be brought up several
times more in your email.  This is and was the primary point of
contention between RDFa supporters and those of us who aren't
convinced it belongs in the HTML5 spec.  It is the major thrust of
much of Ian's email; he's trying to help you (RDFa supporters in
general, that is) find exactly what the problem is that RDFa
specifically is trying to solve.)

> Moreover, that is not actually a very good question in this case. I think
> the judgement call should be whether a syntax that allows people to solve
> the identified problem set consistently is sufficiently valuable (measured
> in terms of the advantages weighed against the disadvantages) to justify
> being part of HTML5.

Well, there are many things that would offer more advantages than
disadvantages by themselves.  We can't possibly include all of them in
the spec; you can think about this as including a hidden large
disadvantage of 'will grow the size of the spec and the amount of work
implementors have to do'.  Thus the advantages must generally be
significantly larger than the disadvantages; this is why the best
argument for including something in the spec is often "there are
already widespread hacks to accomplish this".  <video>, for example,
was included based on pretty much precisely that argument.

Of course, that just means that we've identified a problem that is
significant enough to be solved in the spec.  There is still
significant work involved in ensuring that we identify a solution that
actually hits the problem squarely; the existing hacks are usually
inadequate, not through any true fault of their own, but merely
because they had not considered the problem broadly enough, or lacked
enough eyes to find rough edges and missing spots.

>> What are the advantages?
>
> Many people will be able to use standard tools which are part of their
> existing infrastructure to manipulate important data. They will be able to
> store that data in a visible form, in web pages. They will also be able to
> present the data easily in a form that does not force them to lose important
> semantics.
>
> People will be able to build toolkits that allow for processing of data from
> webpages without knowing, a priori, the data model used for that
> information.

Part of the point of Ian's email is that this is not a problem that is
solved by RDFa, it's a problem that's solved by *any* sufficient data
format.  Many solutions currently exist which don't require any
addition to the spec.

>> What is the
>> opportunity cost of encouraging everyone to expose data in the same way?
>
> I don't know. I don't see much of an opportunity cost.

There is no perfect data model, or perfect representation method.
Every group of data is different, has different ideal representations,
and incurs some degree of cost when forced into an existing data model
(that is, one not tailored to the data's specs).  This must thus be
considered.

>>  - As another example, why doesn't Craigslist like their data being
>> reused in mashups? Would they be willing to allow their users to reuse
>>   their data in these new and exciting ways, or would they go out of
>>   their way to prevent the data from being accessible as soon as a
>>   critical mass of users started using it?
>
> This is a key question. Why *should* a data provider be required to offer
> their product (data) for other people to use, in order to demonstrate that
> the data is useful. Google, a large provider of data, insists on certain
> conditions being met before it makes its services available, and that seems
> perfectly reasonably to me.
>
> Whether Craigslist actively attempts to make their data easier to aggregate,
> or actively avoids facilitating that process, strikes me as irrelevant to
> the question of whether there is value in enabling them to do so. Because
> large organisations specialising in gathering people's data, from Flickr to
> Google and Facebook to Government taxation departments are not the only
> consumers and producers of data that determine value for users.
>
> It would seem important that the Web easily enable small-time users of data
> to efficiently communicate with one another, without the need to have one of
> the giants as an intermediary. When libraries in the Dominican Republic want
> to share data, and librarians in Léon want to use that data, it seems that
> the Web should facilitate that without resorting to intermediaries like
> Amazon or Yahoo! and since we already have the technology to do so in a way
> that enables very powerful data models to be used without requiring
> coordination, it seems odd that you don't even understand how this could be
> valuable.

This is precisely a key question because of many of the arguments that
RDFa supporters have brought up (specifically, in the last flurry of
emails to the group on this subject), that having RDFa will allow web
users to query their browsers, which can then seek out structured data
to answer their questions.  If large websites are not willing to
provide their data to the web-at-large in a structured format, though,
then all the data formats in the world won't accomplish the goal.

In this email, though, you are largely arguing for smaller, more
personal use cases.  Most of the questions are still valid, however.
Problem: Librarians across the world want to share data.  What are the
requirements here?  How is RDFa meet those requirements?  Are there
other solutions which meet those requirements better?  Are existing
solutions adequate if deployed consistently (thus negating the need
for a new technology)?

Specifically, small-time users seem (to me, at least) to need RDFa as
a solution the least.  They can negotiate a shared data format
themselves, or at least present an API that can be engineered against
by others.  RDF itself may be a useful tool here, if it allows reuse
of existing tools and thus simplifies the process of sharing and
consuming the data, but RDFa specifically is a solution for embedding
this data within a web page and allowing browsers to digest it as they
encounter it.  This is not an appropriate solution for the sharing of
catalog data between libraries; it *may* be a solution for the average
web user to have their browser grab the embedded information on a page
for a specific book and query for reviews on the product across the
web.

This, though, then once again brings up the traditional questions.  Is
RDFa the best solution for this?  Are there existing solutions to
this?  Ian specifically mentioned simply Googling for the book title;
this is indeed often quite adequate for a web user.  Does the use of
RDFa and the active involvement of the browser in this process offer
enough of a benefit above just typing a phrase into the search bar to
justify inclusion into the spec?  If you believe so, can you explain
precisely why?

>> Can only previously configured, hard-coded questions be asked,
>> or will Web browsers be able to answer arbitrary free-form questions from
>> users using the data exposed by RDFa?
>
> Both of these are possible. The value of RDFa is that it actually supports
> the possibility of asking free-form questions by using a data model that is
> sufficiently well specified to enable constructions of tools that are not
> dependent on being preconfigured to recognise the exact type of data being
> queried (unlike, say, microformats, which require an intermediate agreement
> to enable people to extract the data, and don't provide for merging data of
> different types for rich queries).

This is not a benefit of RDFa.  It *may* be a benefit of RDF.  What
does RDFa bring to the table that other solutions do not?  What does
it take away?

> Aggregating data in real-time is relatively expensive, so is a strategy more
> suited to dealing with asking new questions. Typical systems so far have
> aggregated data in the background to deal with known queries (one example is
> Google, which crawls pages in advance, anticipating searches that match
> terms against the content of those pages),

Google is a large company, and can indeed invest resources into
trawling and recording such data.  This is explicitly not an option
for the smaller uses you seem to be highlighting in this email,
though.  RDFa is specifically a (very) distributed data storage
system.  Can it address these sorts of problems, if the small-time
users simply can't trawl the entire web for matching information?
When the info is relatively contained (such that finding and reading
the pages it exists on is feasible), is trawling the pages for RDFa
data the best solution?  Are there other solutions which would work
better (such as providing an API for hitting a database)?  Are there
existing solutions which work adequately?

> and use live querying for cases
> where the result cannot reliably be stored (e.g. airline reservation systems
> like TravelJungle or LastMinute which determine price and availability based
> on constantly changing data).

Similarly, would these sites work by trawling reservation sites for
RDFa data?  As well, what if the reservation sites aren't interested
in providing the data in a machine-readable format (for example, if
they want users to go directly to their sites)?  Would it be better
for these types of sites to hit an API provided by the reservation
sites directly?  Would it be better for the discount sites to trawl
with custom algorithms that don't require the cooperation of the
reservation sites?  Within the space of page-embedded data, are there
better solutions, or existing adequate solutions?

>>  - Systems like Yahoo! Search and Live Search expend extraordinary
>> amounts of resources on spam fighting technology; such technology
>>   would not be accessible to Web browsers unless they interacted with
>>   anti-spam services much like browsers today interact with
>>   anti-phishing services.
>
> Actually, at least Opera already incorporates anti-spam technology in its
> mail client. Where browsers are the primary consumers of data there is
> nothing at all to suggest that they cannot incorporate anti-spam technology
> directly. (Indeed, the POWDER specification is designed in part to make that
> easy - and it is exactly the sort of data that might sometimes be usefully
> encoded in RDFa since it is based on an RDF model).

Fighting email spam is a different problem from fighting black-hat SEO
spamming.  The attack surfaces presented by RDFa are much closer to
the latter than the former.

>>  - Even with a mechanism to distinguish trusted sites from spammy sites,
>>   how would Web browsers deal with trusted sites that have been subject
>>   to spamming attacks? This is common, for instance, on blogs or wikis.
>
> Right. But that doesn't mean we question whether browsers should enable
> blogs or wikis. Why would RDFa data be different enough to make this
> question relevant?

Users are interacting with blogs/wikis on a human level, and thus can
exercise their own (admittedly poor in practice) judgement.  This is a
different problem from the browser automatically parsing data on a
page and removing the spam.

> I presume the same would apply if the "Web Services" people came and asked
> to have all of their things included in HTML, and offered a specification
> that could be used to achieve their desires.

It would be the case that they would be subject to the same questions
as the RDFa spec is, yes.

> ...
>
> [not clear what the context was here, so citing as it was]
>>>
>>> > I don't think more metadata is going to improve search engines. In
>>> > practice, metadata is so highly gamed that it cannot be relied upon.
>>> > In fact, search engines probably already "understand" pages with far
>>> > more accuracy than most authors will ever be able to express.
>>>
>>> You are correct, more erroneous metadata is not going to improve search
>>> engines. More /accurate/ metadata, however, IS going to improve search
>>> engines. Nobody is going to argue that the system could not be gamed. I
>>> can guarantee that it will be gamed.
>>>
>>> However, that's the reality that we have to live with when introducing
>>> any new web-based technology. It will be mis-used, abused and corrupted.
>>> The question is, will it do more good than harm? In the case of RDFa
>>> /and/ Microformats, we do think it will do more good than harm.
>>
>> For search engines, I am not convinced. Google's experience is that
>> natural language processing of the actual information seen by the actual
>> end user is far, far more reliable than any source of metadata. Thus from
>> Google's perspective, investing in RDFa seems like a poorer investment
>> than investing in natural language processing.
>
> Indeed. But Google is something of an edge case, since they can afford to
> run a huge organisation with massive computer power and many engineers to
> address a problem where a "near-enough" solution brings themn the users who
> are in turn the product they sell to advertisers. There are many other use
> cases where a small group of people want a way to reliably search trusted
> data.
>
> From global virtual library systems to a single websites, there are many
> others who find that processing structured data is more efficient for their
> needs than doing free-text analysis of web pages (something that they
> effectively contract out to Google, Ask, Yahoo! and their many competitors
> who specialise in it). Some of these are the people whe have decided that
> investing in RDFa is a far more valuable exercis than trying to out-invest
> Google in natural language processing.

"Processing structured data" is something that can be done without
RDFa.  The reason for the resistance to RDFa from this working group
so far is the lack of sufficient significant problems that are best
solved by RDFa specifically.

As well, the use cases for in-the-small data interchange and
in-the-large data interchange are significantly different.  Again,
RDFa is a very distributed data storage format; you don't see the
entire 'database' until you've trawled all the pages which include it.
 This is why there is such a focus on whether RDFa is a decent
solution for search engines - they *see* the web better than anyone
else, and thus appear to be able to utilize such a distributed data
format most effectively than anyone else.  However, Ian is pointing
out that those same search engines (at least Google, though I expect
Yahoo, etc. feel the same) believe that natural-language processing is
a far more effective method of gathering information.  It is less
prone to gaming (natural language being naturally unstructured, it's
harder to emit spam data that has the same statistical
characteristics), and allows for extracting far more data
automatically than any one user would ever think to include.

> This email is already too long for most people to get through it :( I
> believe that this discussion is going to last for some time (I cannot
> imagine why, given the HTML timeline, it would need to be resolved before
> June), so there will be time for others to discuss more fully the many
> points Ian raises as ones he would like to understand.

The HTML timeline is partially a joke (2023 is the date for 'full
compliance'; there isn't a single browser yet who has fully
implemented *html4* ^_^).  We still would like things resolved with
all due speed; the faster they hit the spec, the faster they'll be
integrated into browsers.

Conclusion
==========

There is significant confusion (or at least lack of distinction) in
your email (and generally in the arguments from RDFa supporters in my
experience) between RDFa and RDF, RDF and the general concept of data
interchange formats, distributed and centralized data storage,
in-the-small data interchange and in-the-large data interchange, and
personal use (ie web users) and organization use (ie search engines).
Each of these individually confuse the argument; when brought together
as they typically are, they render many arguments completely useless.

Separating RDFa from RDF
------------------------

The bonuses/maluses of RDF itself are completely irrelevant to this
discussion.  This is because there already exists several methods in
active use for embedding RDF in a web page.  In other words, whatever
problem requires you to embed RDF in a webpage has been *solved*, and
without any necessity of cooperation from the html language itself.
RDFa is specifically a proposal to embed structured data in a web page
using attributes on elements.  *This* is the solution we need to find
problems for if we want RDFa merged into the spec.

Separating RDF from general data interchange formats
----------------------------------------------------

Many of the problems that can be solved by using a common data
interchange format don't require specifically what RDF brings to the
table.  As noted earlier in this email, every collection of data has
its own shape, and its own particular 'ideal' representation.  RDF
forces a particular method of representation.  This has its bonuses
and maluses, but they are *completely separate* from the
bonuses/maluses of generically using a data interchange format.
Libraries don't need RDF to exchange data, they just need *some*
agreement on data representation.  What problems are specifically
solved by RDF and its specific representation being favored in the
spec over a more general method of data representation?

Separating distributed and centralized data storage
----------------------------------------------------

RDFa is a distributed data storage format - a single page includes
only a fraction of the relevant data.  The opposite possibility is
centralized data storage - a single entity holding the data in a
particular place (such as a database on their servers).  The latter is
very common, simple, and natural.  To get at the data, you just run
queries against the single database.  This does require the entity
with the data to produce an API to run queries against, but the same
is required for use of a distributed data format (the company in
charge of the site has to specifically code to expose that data in the
given format).  Both storage methods, though, allow sharing of data
and enable all manner of useful web services.  What problems are
specifically solved by a distributed data strategy which are solved
worse or not at all by a centralized data strategy?

Separating in-the-small and in-the-large data interchange
---------------------------------------------------------

In-the-small data interchange involves a small number of entities who
can trust each other and generally receive a direct benefit from
structuring and sharing their data.  In-the-large data interchange
involves a large number of disparate entities who *can't* trust each
other and won't generally receive direct benefit for structuring their
data.  What problems are shared by these two situations?  Which are
best solved by RDFa?  Are there existing solutions to these problems
that are adequate?

If RDFa is intended to be for one or the other of these situations, it
would be convenient for advocates to agree which it is, so that we can
then focus the discussion on that.  As it is we are getting into
useless arguments where someone is talking about one situation, and
then someone else brings up a "Yes, but..." involving the other
situation.

Separating personal consumption from corporate consumption
----------------------------------------------------------

It has already been noted that existing search engines have found
metadata to be generally unreliable, and instead rely on
natural-language processing to extract information from pages.  Can
RDFa offer better solutions to the problems of search engines than
they currently employ?

Personal use is an entirely different issue.  RDFa is often touted as
making it easy for users to look up information about data on the
page.  It has also been noted, though, that simply highlighting some
text (say, a song title) and selecting "Search Google for the text
'...'" (specific text is from my machine; your experience may vary)
does essentially the same thing, and possibly offers much more.  As
well, new features such as IE8's accelerators offer even more advanced
functionality when you need it, such as allowing you to search
IMBD.com specifically for your highlighted text, using IMDB's own
search form.  Are there significant problems left in this space?  Does
RDFa solve them?  Are they better solved by other solutions?

~TJ