[whatwg] Trying to work out the problems solved by RDFa

Sat Jan 3 17:14:49 PST 2009

On Sat, 03 Jan 2009 04:52:35 +1100, Tab Atkins Jr. <jackalmage at gmail.com>  
wrote:

> On Fri, Jan 2, 2009 at 12:12 AM, Charles McCathieNevile
> <chaals at opera.com> wrote:
>> On Fri, 02 Jan 2009 05:43:05 +1100, Andi Sidwell <andi at takkaria.org>  
>> wrote:
>>
>>> On 2009-01-01 15:24, Toby A Inkster wrote:
>>>>
>>>> The use cases for RDFa are pretty much the same as those for
>>>> Microformats.
>>>
>>> Right, but microformats can be used without any changes to the HTML
>>> language, whereas RDFa requires such changes.  If they fulfill the  
>>> same use
>>> cases, then there's not much point in adding RDFa.
>>
>> ...
>
> Why the non-response?

Because the response comes in the next paragraph, to the first question  
that was worth asking.

>>>> So why RDFa and not Microformats?
>>
>> (I think the question should be why RDFa is needed *as well as*  
>> µformats)
>
> This is correct.  Microformats exist already.  They solve current
> problems.

(Elsewhere in this thread you wrote
[[[
It has not yet been established that there is a problem worth solving that  
metadata would address at all.
]]]
Do you consider that µformats do not encode metadata? Otherwise, I am not  
sure how to reconcile these statements. In any case I would greatly  
appreciate clarification of what you think microformats do, since I do  
believe that microformats are very explicitly directed to allowing the  
encoding of metadata, anbd therefore it is not clear that we are  
discussing from similar premises).

>  Are there further problems that Microformats don't address
> which can be solved well by RDFa?  Are these problems significant
> enough to authors to be worth addressing in the spec, or can we wait
> and let the community work out its own solutions further before we
> make a move?

In my opinion, yes there are further problems µformats don't solve (that  
RDFa does), yes they are significant, and the community has come up with a  
way to solve them - RDFa.

> Microformats are the metadata equivalent of Flash-based video players.
>  They are hacks used to allow authors to accomplish something not
> explicitly accounted for in the language.  Are there significant
> problems with this approach?

Yes. The problems are that they rely on precoordination on a  
per-vocabulary basis before you can do anything useful with the data. In  
practical usage they rely on choosing attribute names that hopefully don't  
clash with anything - in other words, trying to solve the problem of  
disambiguation that namespaces solves, but by choosing names that are  
wierd enough not to clash or by circumscribing the problem spaces that can  
be addressed to the extent that you can expect no clashes.

(This is hardly news, by the way).

> Is metadata embedding used widely enough
> to justify extending the language for it, or are the current hacks
> (Microformats, in this case) enough?  Are current metadata embedding
> practices mature enough that we can be relatively sure we're solving
> actual problems with our extension?

Current metadata embedding is done using µformats, and it's pretty clear  
that they are not sufficient. A large body of work uses RDF data models  
(Dublin Core, IMS, LOM, FOAF, POWDER are all large-scale formats. The  
people who are testing RDF engines with hundreds of millions of triples  
and more are doing it with real data, not stuff generated for the  
experiment).

It is also clear that people would like to develop further small-scale  
formats, and that µformats through its requirement for community  
consultation is effectively too heavyweight for the purposes of many  
developers.

>  These are all questions that must
> be asked of any extention to the language.
>
>>>> Firstly, RDFa provides a single unified parsing algorithm that
>>>> Microformats do not. ...
>>
>>> This is not necessarily beneficial.  If you have separate parsing
>>> algorithms, you can code in shortcuts for common use-cases and thus  
>>> optimise the authoring experience.
>>
>> On the other hand, you cannot parse information until you know how it is
>> encoded, and information encoded in RDFa can be parsed without knowing  
>> more.
>>
>> And not only can you optimise your parsing for a given algorithm, you  
>> can also do for a known vocabulary - or you can optimise the
>> post-parsing treatment.
>
> What is the benefit to authors of having an easily machine-parsed
> format?

Assuming that the format is sufficiently easy to write, and to generate, I  
am not sure what isn't obvious about the answer to the question.

(In case I am somehow very clever, and others aren't, the benefit is that  
it is easy to machine parse and use the information).

> Are they greater than the benefits of a
> format that is harder to parse, but easier for authors to write?

For a certain set of authors, yes the benefits are greater.

>>>  Also, as has been pointed out before in the distributed extensibility
>>> debate, parsing is a very small part of doing useful things with  
>>> content.
>>
>> Yes. However many of the use cases that I think justify the inclusion of
>> RDFa are already very small on their own, and valuable when several
>> vocabularies are combined. So being able to do off-the-shelf parsing is
>> valuable, compared to working out how to parse a combination of formats
>> together.
>
> Can you provide these use-cases?  The discussion has an astonishing
> dearth of use-cases by which we can evaluate the effectiveness of
> proposals.

The small-scale use cases are difficult to provide, since they are based  
on the fact that people do something quickly because they need it. One set  
of potential use cases is all the microformats that haven't been blessed  
by the µformats community as formally agreed "standards" - writing them in  
RDFa is sufficient to have them be usable.

Another use case is noting the source of data in mashups. This enables  
information to be carried about the licensing, the date at which the data  
was mashed (or smushed, to use the older terminology from the Semantic  
Web), and so on.

Another (the second time I have noted it in two emails) is to provide  
information useful for improving the accessibility of Web content.

The set of use cases that led to the development of GRDDL are also use  
cases for RDFa - since RDFGa allows a direct extraction to RDF without  
having to develop a new parser for each data model, authors can simplify  
the way they extract data by using RDFa to encode it, saving themselves  
the bother of explaining how to extract it. This time saving means that  
they can afford to develop a smaller, more specialised vocabulary.

> Is there any indication that use of
> ambiguous names produces significant problems for authors?

Not that I am aware of, although I think the question is poorly considered  
so I haven't given it much thought. There is plenty of evidence (for  
example the attempts to use Dublin Core within existing HTML mechanisms)  
that it causes problems for data consumers.

>>>> It can be argued that going through a
>>>> community to develop vocabularies is beneficial, as it allows the
>>>> vocabulary to be built by "many minds" - RDFa does not prevent this,  
>>>> it
>>>> just gives people alternatives to community development.
>>>
>>> RDFa does not give anything over what the class attribute does in  
>>> terms of
>>> community vs individual development, so this doesn't really speak in  
>>> RDFa's
>>> favour.
>>
>> In principle no, but in real world usage the class attribute is  
>> considered something that is primarily local, whereas RDFa is generally
>> used by people who have a broader outlook on the desirable permanence
>> and re-usability of their data.
>
> Can we extract a requirement from this, then?

A poor formulation (I hope that those who are better at very detailed  
requirements can help improve my phrasing) could be:

Provide an easy mechanism to encode new data in a way that can be  
machine-extracted without requiring any explanation of the data model.

>>>> Lastly, there are a lot of parsing ambiguities for many Microformats.
>>>> One area which is especially fraught is that of scoping. The editors  
>>>> of
>>>> many current draft Microformats[1] would like to allow page authors to
>>>> embed licensing data - e.g. to say that a particular recipe for a pie  
>>>> is
>>>> licensed under a Creative Commons licence. However, it has been noted
>>>> that the current rel=license Microformat can not be re-used within  
>>>> these
>>>> drafts, because virtually all existing rel=license implementations  
>>>> will
>>>> just assume that the license applies to the whole page rather than  
>>>> just
>>>> part of it. RDFa has strong and unambiguous rules for scoping - a
>>>> license, for example, could apply to a section of the page, or one
>>>> particular image.
>>>
>>> Are there other cases where this granularity of scoping would be  
>>> genuinely
>>> helpful?  If not, it would seem better to work out a solution for  
>>> scoping
>>> licence information...
>>
>> Yes.
>>
>> Being able to describe accessibility of various parts of content, or  
>> point
>> to potential replacement content for particular use cases, benefits
>> enormously from such scoping (this is why people who do industrial-scale
>> accessibility often use RDF as their infrastructure). ARIA has already  
>> taken
>> the approach of looking for a special-purpose way to do this, which
>> significantly bloats HTML but at least allows important users to satisfy
>> their needs to be able t produce content with certain information  
>> included.
>>
>> Government and large enterprises produce content that needs to be
>> maintained, and being able to include production, cataloguing, and  
>> similar
>> metadata directly, scoped to the document, would be helpful. As a  
>> trivial
>> example, it would be useful to me in working to improve the Web content  
>> we
>> produce at Opera to have a nice mechanism for identifying the original
>> source of various parts of a page.
>
> Can we distill this into use-cases, then?

Sure. It just takes a small amount of thinking. How many use cases would  
you think will be sufficient to demonstrate that this would be important.  
Or do you measure it by how many people each use case applies to? (It is  
far easier to justify the cost of developing use cases where there is more  
clarity about the goals for those use cases - and it enables people to  
decide whether to develop their own, or go find the people who are doing  
this and ask them to provide the information).

>  You, as an author, want to
> be able to specify the original source of a piece of content.  What's
> the practical use of this?  Does it require an embedded,
> machine-readable vocabulary to function?  Are existing solutions
> adequate (frex, footnotes)?
...
> Not quite.  Specifically, is there any practical use for marking up
> various sections of a site with licensing information specific to that
> section *in an embedded, machine-readable manner*?  Are the existing
> solutions adequate (frex, simply putting a separate copyright notice
> on each section, or noting the various copyrights on a licensing
> page)?

Let me treat these as the same question since I don't think they introduce  
anything usefully different between them. I will add to that Henri's  
questions about my use case for this already published elsewhere in this  
thread.

A practical use case is in an organisation where different people are  
responsible for different parts of content. Instead of having to look up,  
myself, who is responsible for each piece, and what rights are associated  
with it, I can automate the process. (This is one of the value  
propositions offered by content management systems. I hope we can agree  
that these are sufficiently widely used to a priori assume a use case, but  
if not please say so). This means that instead of manually checking many  
pages for things like accessibility or being up to date, and then having  
to find which part of the page was produced by which part of the  
organisation (which is what I do at Opera) I can simply have this  
information trawled and presented as I please by a program (which many  
large organisations do, or partially do).

Another example is that certain W3C pages (the list of specifications  
produced by W3C, for example, and various lists of translations) are  
produced from RDF data that is scraped from each page through a customised  
and thus fragile scraping mechanism. Being able to use RDFa would free  
authors of the draconian constraints on the source-code formatting of  
specifications, and merely require them to us the right attributes, in  
order to maintain this data.

An example of how this data can be re-used is that it is possible to  
determine many of the people who have translated W3C specifications or  
other documents - and thus to search for people who are familiar with a  
given technology at least at some level, and happen to speak one or more  
languages of interest. This is at least as important to me in looking for  
potential people to recruit as any free-text search I can do - and has the  
benefit that while I don't have the resources to develop large-scale  
free-text searching, I do have the resources to develop simple queries  
based on a standardised data model and an encoding of it.

Alternatively I could use the same information to seed a reputation  
manager, so I can determine which of the many emails I have no time to  
read in WHAT-WG might be more than usually valuable.

cheers

Chaals

-- 
Charles McCathieNevile  Opera Software, Standards Group
     je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com