[whatwg] Trying to work out the problems solved by RDFa

Thu Jan 8 17:54:08 PST 2009

Charles McCathieNevile ha scritto:
> On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino 
> <alex.baldacchino at email.it> wrote:
>
>> Charles McCathieNevile ha scritto:
>> ... it shouldn't be too difficoult to create a custom parser, 
>> comforming to RDFa spec and availing of data-* attributes...
>>
>> That is, since RDFa can be "emulated" somehow in HTML5 and tested 
>> without changing current specification, perhaps there isn't a strong 
>> need for an early adoption of the former, and instead an "emulated" 
>> mergence might be tested first within current timeline.
>
> In principle this is possible. But the data-* attributes are designed 
> for private usage, and introducing a public usage means creating a 
> risk of clashes that pollute RDFa data gathered this way. In other 
> words, this is indeed feasible, but one would expect it to show that 
> the data generated was unreliable (unless privately nobody is 
> interested in basic terms like about). 

This is why I was thinking about somewhat "data-rdfa-about", 
"data-rdfa-property", "data-rdfa-content" and so on, so that, for the 
purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a 
test phase, if needed at all, of course), an element dataset would give 
access to "rdfa-about", instead of just "about", that is using the 
prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if 
there were "rdfa:about" instead of "data-rdfa-about" in the markup).

This way, the public exposure of RDFa attributes on top of the generic 
and normally-private dataset feature might be enough circumscribed to 
avoid clashes. That is, if RDFa shows its best benefits when used to 
address small-scale needs involving trusted/reliable (meta-)data, it 
should be fair to assume all involved parties are aware that each one is 
using RDFa, and aren't just running an RDFa processor in the hope to 
gather enough informations.

 From this point of view, it should be quite unlike to find people using 
"data-rdfa-about" to express different semantics in the same page 
(whereas data-property might cause ambiguity, for instance), as well as 
it is (or should be) quite unlike to find namespaces using the very same 
prefix involved in the same xml document (that is, I think choosing a 
name including a namespace prefix for a data-* attribute (and also for a 
class in a generic container as a div or a span, to tell it represents 
an external element) can replicate quite safely the xml extensibility 
for custom uses, to some extent, without requiring a wide support for it 
in text/html document - since it seems that xhtml extensibility is not a 
major concern, at least not enough to be worth merging it into html).

Just an idea, though.

However, AIUI, actual xml serialization (xhtml5) allows the use of 
namespaces and prefixed attributes, thus couldn't a proper namespace be 
introduced for RDFa attributes, so they can be used, if needed, in 
xhtml5 documents? I think such might be a valuable choice, because it 
seems to me RDFa attributes can be used to address such cases where 
metadata must stay as close as possible to correspondent data, but a 
mistake in a piece of markup may trigger the adoption agency or foster 
parenting algorithms, eventually causing a separation between metadata 
and content, thus possibly breaking reliability of gathered 
informations. From this perspective, a parser stopping on the very first 
error might give a quicker feedback than one rearranging misnested 
elements as far as it is reasonably possible (not affecting, and instead 
improving, content presentation and users' "direct" experience, but 
possibly causing side-effects with metadata).

Also, if the above is true, using namespaced and prefixed attributes 
instead of ones laying in the same namespace shared both by html5 and by 
xhtml5 (in theory) might prevent the use of such metadata in a document 
whose parsing rules might lead to possible side-effects.

> Such results have been used to suggest that poorly implemented 
> features should be dropped, but this hypothetical case suggests to me 
> that the argument is wrong, and that if in the face of reasons why the 
> data would be bad people use them, one might expect better usage by 
> formalising the status of such features and getting decent 
> implementations.
>

Generally speaking, I think reasoning in terms of "poor implementation" 
vs "rare usage" is quite like moving as a dog biting his own tail, 
because poorly implemented features are forcedly rarely used, and rarely 
used features can't convince UAs developers to implement them (in 
general). But, if a feature is widely needed, several hacks may born, 
thus providing an evidence of a global problem to be solved in a certain 
manner by implementing a certain, well-conceived feature.

As far as I've understood it, that's the main guideline to change actual 
specification, which is moving on the base of a bullet-tracing evolution 
(perhaps weighted on the need for completely new features, as a balance 
between the need for innovation and that for backward compatibility), 
rather than a "cathedral-wise" definition of what is or can be a useful 
feature to be implemented. For this reason, I think that mapping RDFa 
attributes on data-rdfa-* attributes to experiment a convergence between 
RDFa attributes and html5 specific features might be a start point to 
get RDFa attributes both specified and widely supported by 
implementations (either as they're defined in W3C Recommendation, or in 
the form of data-rdfa-*, hence dealt with differently from data-* 
attributes, for backward compatibility with such early implementations - 
a slightly different (or somehow prefixed) name shouldn't be much of a 
problem, as far as the name is not a problem per se (e.g. it is not 
prone to clashes) and allows a one-to-one correspondence).

However, if a custom/small scale solution met a wide support and a deep 
integration into major browsers, maybe misuses and abuses (which a 
proper formalisation couldn't prevent) might become widespread, thus 
making disadvantages (appear or be) greater than advantages, if measured 
on a wider scale (the same as the implementation). Therefore, I think a 
good start point can consist of partly introducing support on top of 
existing features (in the case of RDFa, either through well-groomed, 
custom data-* attributes in html5, or by defining a proper namespace 
with a proper prefix for xhtml5), without requiring a deep integration 
of a processor for the new feature, but instead letting it be a (custom) 
plugin/extension, or an api for a (custom) web application needing it - 
since a person just wishing to get access to some content without caring 
of metadata and metadata reliability could just visit a page, while an 
organisation wishing to interchange RDFa modelled data with another one 
can run a separate processor (eventually a webapp based on a browser 
built-in API, or a plugin, to create a suitable interface for queries) 
to extract and merge informations.

>>>> What is the cost of having different data use specialised formats?
>>>
>>> If the data model, or a part of it, is not explicit as in RDF but is 
>>> implicit in code made to treat it (as is the case with using scripts 
>>> to process things stored in arbitrarily named data-* attributes, and 
>>> is also the case in using undocumented or semi-documented XML 
>>> formats, it requires people to understand the code as well as the 
>>> data model in order to use the data. In a corporate situation where 
>>> hundreds or tens of thousands of people are required to work with 
>>> the same data, this makes the data model very fragile.
>>>
>>
>> I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds 
>> (xml) properties and attributes (in the form of curies) to RDF 
>> concepts, modelling a certain kind of relationships, whereas it 
>> relies on external schemata to define such properties. Any 
>> undocumented or semi-documented XML formats may lead to misuses and, 
>> thus, to unreliably modelled data,
> ...
>
>> I think the same applies to data-* attributes, because _they_ 
>> describe data (and data semantics) in a custom model and thus _they_ 
>> need to be documented for others to be able to manipulate them; the 
>> use of a custom script rather than a built-in parser does not change 
>> much from this point of view.
>
> RDFa binds data to RDF. RDF provides a well-known schema language with 
> machine-processable definition of vocabularies, and how to merge 
> information between them. In other words, if you get the underlying 
> model for your data right enough, people will be able to use it 
> without needing to know what you do.
>
> Naturally not everyone will get their data model right, and naturally 
> not all information will be reliable anyway. However, it would seem to 
> me that making it harder to merge the data in the first place does not 
> assist in determining whether it is useful. On the other hand, certain 
> forms of RDF data such as POWDER, FOAF, Dublin Core and the like have 
> been very carefully modelled, and are relatively well-known and 
> re-used in other data models. Making it easy to parse this data and 
> merge it, according to the existing well-developed models seems valuable.
>

I admit I'm not very expert in RDF use, thus I have a few questions. 
Specifically, maybe I can guess the advantages when using the same 
(carefully modelled, and well-known) vocabulary/ies; but when two 
organizations develop their own vocabularies, similar yet different, to 
model the same kind of informations, is merging of data enough? Can a 
processor give more than a collection of triples, to be then interpreted 
basing on knowledge on the used vocabulary/ies?

I mean, I assume my tools can extract RDF(a) data from whatever 
document, but my query interface is based on my own vocabulary: when I 
merge informations from an external vocabulary, do I need to translate 
one vocabulary to the other (or at least to modify the query backend, so 
that certain curies are recognized as representing the same concepts - 
e.g. to tell my software that 'foaf:name' and 'ex:someone' are 
equivalent, for my purposes)? If so, merging data might be the minor 
part of the work I need to do, with respect to non-RDF(a) metadata (that 
is, I'd have tools to extract and merge data anyway, and once I 
translated external metadata to my format, I could use my own tools to 
merge data), specially if the same model is used both by mine and an 
external organization (therefore requiring an easier translation).

Thus, I'm thinking the most valuable benefit of using RDF/RDFa is the 
sureness that both parties are using the very same data model, despite 
the possible use of different vocabularies -- it seems to me that the 
concept of triples consisting of a subject, a predicate and an object is 
somehow similar to a many-to-many association in a database, whereas one 
might prefer a one-to-many approach - though, the former might be a 
natural choice to model data which are usually sparse, as in a document 
prose.

>
>>> Ian wrote:
>>>> For search engines, I am not convinced. Google's experience is that
>>>> natural language processing of the actual information seen by the 
>>>> actual end user is far, far more reliable than any source of metadata.
>>>> Thus from Google's perspective, investing in RDFa seems like a poorer
>>>> investment than investing in natural language processing.
>>>
>>> Indeed. But Google is something of an edge case, since they can 
>>> afford to run a huge organisation with massive computer power and 
>>> many engineers to address a problem where a "near-enough" solution 
>>> brings themn the users who are in turn the product they sell to 
>>> advertisers. There are many other use cases where a small group of 
>>> people want a way to reliably search trusted data.
>>>
>>
>> I think the point with general purpose search engines is another one: 
>> natural language processing, whereas being expensive, grants a far 
>> more accurate solution than RDFa and/or any other kind of metadata 
>> can bring to a problem requiring data must never need to be trusted 
>> (and, instead, a data processor must be able to determine data's 
>> level of trust without any external aid).
>
> No, I don't think so. Google searches based on analysis of the open 
> web are *not* generally more reliable than faceted searches over a 
> reliable dataset, and in some instances are less reliable.
>
> The point is that only a few people can afford to invest in being a 
> general-purpose search engine, whereas many can afford to run a 
> metadata-based search system over a chosen dataset, that responds to 
> their needs (and doesn't require either publishing their data, or 
> paying Google to index it).
>

My point is that possible assumptions over datasets reliability is the 
borderline between wide-scale data extraction/classification, which is 
the main problem solved by a general purpose search engine, and implies 
the best assumption by default is datasets reliability is uncertain, and 
(very) small-scale data modelling, were a direct and immediate 
evaluation over datasets reliability is possible and easy to do, so that 
a custom search engine could reliably be based on such metadata. I think 
no comparison is possible between the two scales, thus no generalization 
is possible when trying to guess whether metadata can do more good than 
harm, but instead each case should be analysed separately, and everyone 
should agree which one is the best context (eventually both) where RDFa 
should be used, to understand what's the best way to implement it and if 
it's worth to be introduced in html5 -- as far as I can tell, both of us 
agree that small-scale is the main context.

But perhaps some edge-side case should be considered to draw a better 
picture. For instance, one such case might be a browser availing of 
metadata to search a resource in its local history, or within a web page 
and related/linked pages (to a certain digree and level of depth), 
because its scale would be small with respect to the effective number of 
scanned resources, but wide with respect to the potential number of 
sources for those resources, that is, because a browser implementing a 
metadata extraction and merging engine and a query interface to look for 
gleaned informations would deal with a | limited number | of | 
etherogeneous sources | at a given time.

Once major browsers provided (and exposed by default) such a 
functionality, a growing number of users would (try and) use it, thus a 
growing number of sites would experiment metadata. At the beginning 
everything might work fine, since only honest sites would experiment 
honest metadata (such as wikis, for instance), but once the number of 
sites and users availing of metadata reached a threshold point spammers 
would start including spam metadata in their sites (with otherwise 
trustful content) and in other sites through advertisements. Such a 
scenario might lead to a bad balance between benefits and disadvantages 
for the average user, thus pushing (some) browser vendors to limit or 
even to wholly drop native support, and I guess this is not a wishable 
solution for the Semantic Web Industry.

That is, choosing a proper level of integration for RDF(a) support into 
a web browser might divide success from failure. I don't know what's the 
best possible level, but I guess the deepest may be the worst, thus 
starting from an external support through out plugins, or scripts to be 
embedded in a webbapp, and working on top of other feature might work 
fine and lead to a better, native support by all vendors, yet limited to 
an API for custom applications -- whereas any changes to html to include 
RDFa attributes would be fully meaningful if leading to a full support 
and exposed features to avail of metadata, which I don't think is much 
of a benefit for the great majority of (home) users.

Everything, IMHO

WBR, Alex

 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

 Sponsor:
 Meetic: il leader italiano ed europeo per trovare l'anima gemella online. Provalo ora
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8291&d=9-1