[whatwg] RDFa Problem Statement (was: Creative Commons Rights Expression Language)
Ian Hickson
ian at hixie.ch
Tue Aug 26 03:02:29 PDT 2008
On Mon, 25 Aug 2008, Manu Sporny wrote:
>
> Web browsers currently do not understand the meaning behind human
> statements or concepts on a web page. While this may seem academic, it
> has direct implications on website usability. If web browsers could
> understand that a particular page was describing a piece of music, a
> movie, an event, a person or a product, the browser could then help the
> user find more information about the particular item in question.
Is this something that users actually want? How would this actually work?
Personally I find that if I'm looking at a site with music tracks, say
Amazon's MP3 store, I don't have any difficulty working out what the
tracks are or interacting with the page. Why would I want to ask the
computer to do something with the tracks?
It would be helpful if you could walk me through some examples of what UI
you are envisaging in terms of "helping the user find more information".
Why is Safari's "select text and then right click to search on Google" not
good enough? Have any usability studies been made to test these ideas?
(For example, paper prototype usability studies?) What were the results?
> It would help automate the browsing experience.
Why does the browsing experience need automating?
> Not only would the browsing experience be improved, but search engine
> indexing quality would be better due to a spider's ability to understand
> the data on the page with more accuracy.
This I can speak to directly, since I work for a search engine and have
learnt quite a bit about how it works.
I don't think more metadata is going to improve search engines. In
practice, metadata is so highly gamed that it cannot be relied upon. In
fact, search engines probably already "understand" pages with far more
accuracy than most authors will ever be able to express.
You started by saying:
> Web browsers currently do not understand the meaning behind human
> statements or concepts on a web page.
This is true, and I even agree that fixing this problem, letting browsers
understand the meaning behind human statements and concepts, would open up
a giant number of potentially killer applications. I don't think
"automating the browser experience" is necessarily that killer app, but
let's assume that it is for the sake of argument.
You continue:
> If we are to automate the browsing experience and deliver a more usable
> web experience, we must provide a mechanism for describing, detecting
> and processing semantics.
This statement seems obvious, but actually I disagree with it. It is not
the case the providing a mechanism for describing, detecting, and
processing semantics is the only way to let browsers understand the
meaning behind human statements or concepts on a web page. In fact, I
would argue it's not even the the most plausible solution.
A mechanism for describing, detecting, and processing semantics; that is,
new syntax, new vocabularies, new authoring requirements, fundamentally
relies on authors actually writing the information using this new syntax.
If there's anything we can learn from the Web today, however, it is that
authors will reliably output garbage at the syntactic level. They misuse
HTML semantics and syntax uniformly (to the point where 90%+ of pages are
invalid in some way). Use of metadata mechanisms is at a pitifully low
level, and when used is inaccurate (Content-Type headers for non-HTML data
and character encoding declarations for all text types are both widely
wrong, to the point where browsers have increasingly complex heuristics to
work around the errors). Even "successful" formats for metadata publishing
like hCard have woefully low penetration.
Yet, for us to automate the browsing experience by having computers
understand the Web, for us to have search engines be significantly more
accurate by understanding pages, the metadata has to be widespread,
detailed, and reliable.
So to get this data into Web pages, we have to get past the laziness and
incompetence of authors.
Furthermore, even if we could get authors to reliably put out this data
widely, we would have to then find a way to deal with spammers and black
hat SEOs, who would simply put inaccurate data into their pages in an
attempt to game search engines and browsers.
So to get this data into Web pages, we have to get past the inherent greed
and evilness of hostile authors.
As I mentioned earlier, there is another solution, one that doesn't rely
on either getting authors to be any more accurate or precise than they are
now, one that doesn't require any effort on the part of authors, and one
that can be used in conjunction with today's anti-spam tools to avoid
being gamed by them and potentially to in fact dramatically improve them:
have the computers learn the human languages themselves.
Instead of making all the humans of the world learn a computer language,
or tools for writing that computer language, have the computers learn the
human language. Not only does this not require us to solve a fundamentally
unsolvable pair of problems (making humans not be lazy and making humans
not be evil), but it also means that the computers would also gain an
understanding of all the legacy content that would otherwise never be seen
by computers.
This kind of thing is already being done, for example with automated
language translation where the software learns for itself how to translate
text, or in search engines that extract information like byline dates and
author credits, without the need for pages to have special markup, or in
data clustering, where tools can examine large sets of data and sort the
content into buckets based on topics without any special markup or user
intervention. Similarly developments in image processing are making huge
steps, with tools that can derive depth mapping data from moving video, or
that can convert a set of static 2D images to a 3D point field. It's clear
that over the coming years, this will only get better and better.
However, let's pretend for now that we can find a way to solve laziness
and evilness and continue with your e-mail:
> If one understands web semantics to be an important part of the web's
> future, the question then becomes, why RDFa? Why not Microformats?
>
> While there are a number of technical merits that speak in favor of RDFa
> over Microformats (fully qualified vocabulary terms
Why is this better?
> prefix short-hand via CURIEs
This is definitely not better.
> accessibility-friendly
How is not reusing HTML semantics better than using them? With the
exception of the now-resolve <time> issue, it seems like Microformats has
the better accessibility story.
> unified processing rules, etc)
Microformats could certainly benefit from a more consistent parsing model,
but that can be obtained without going to RDFa.
> [...] this issue really boils down to one of centralized innovation vs.
> distributed innovation.
I don't see what the syntax has anything to do with whether the formats
are developed centrally or not. Nothing is stopping anyone from creating
another Microformats-like organisation that does the same thing without
going through the Microformats.org process. There could be millions of
them, in fact. So long as they pick names that are suitably unique (e.g.
URIs, or Java-like identifiers), or so long as they don't promote the use
of their formats outside of their own site, I don't see a problem.
In fact, this is happening every day, with each author making up his own
class values for use on this own site.
> The Microformats community, and all communities like it, require a group
> of people to come together, collaborate and create a standard vocabulary
> to express ALL semantics.
Well, for any one person to do anything useful with the data on the Web,
they have to have a core vocabulary (or set of vocabularies) that they
understand. So a set of standard vocabularies to express all the semantics
that that one person is interested in is needed, yes.
It doesn't have to be _all_ semantics, however. I might want to have a
format for annotating Stargate analysis Web pages that me and my friends
write, but so long as me and my friends agree on it it doesn't have to
involve anyone else.
> A somewhat strained analogy would be bringing in representatives from
> all of the cultures of the world and having them agree on a universal
> vocabulary.
That's pretty much exactly what Unicode did. Or what we're doing with
HTML. That doesn't seem untennable, it seems quite reasonable.
However, I'm not suggesting that it should be necessary.
> In short, RDFa addresses the problem of a lack of a standardized
> semantics expression mechanism in HTML family languages. RDFa not only
> enables the use cases described in the videos listed above, but all use
> cases that struggle with enabling web browsers and web spiders
> understand the context of the current page.
I'm not convinced the problem you describe can be solved in the manner you
describe. It seems to rely on getting authors to do something that they
have shown themselves incapable of doing over and over since the Web
started. It seems like a much better solution would be to get computers to
understand what humans are doing already.
Even if we ignore that, it doesn't seem like the above discussion would
lead one to a set of requirements that would lead one to design a language
like RDFa.
Thanks for the explanation, by the way. This is by far the most useful
explanation of RDFa that I have ever seen.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list