[whatwg] Trying to work out the problems solved by RDFa
Calogero Alex Baldacchino
alex.baldacchino at email.it
Thu Jan 8 17:54:08 PST 2009
Charles McCathieNevile ha scritto:
> On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino
> <alex.baldacchino at email.it> wrote:
>> Charles McCathieNevile ha scritto:
>> ... it shouldn't be too difficoult to create a custom parser,
>> comforming to RDFa spec and availing of data-* attributes...
>> That is, since RDFa can be "emulated" somehow in HTML5 and tested
>> without changing current specification, perhaps there isn't a strong
>> need for an early adoption of the former, and instead an "emulated"
>> mergence might be tested first within current timeline.
> In principle this is possible. But the data-* attributes are designed
> for private usage, and introducing a public usage means creating a
> risk of clashes that pollute RDFa data gathered this way. In other
> words, this is indeed feasible, but one would expect it to show that
> the data generated was unreliable (unless privately nobody is
> interested in basic terms like about).
This is why I was thinking about somewhat "data-rdfa-about",
"data-rdfa-property", "data-rdfa-content" and so on, so that, for the
purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a
test phase, if needed at all, of course), an element dataset would give
access to "rdfa-about", instead of just "about", that is using the
prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if
there were "rdfa:about" instead of "data-rdfa-about" in the markup).
This way, the public exposure of RDFa attributes on top of the generic
and normally-private dataset feature might be enough circumscribed to
avoid clashes. That is, if RDFa shows its best benefits when used to
address small-scale needs involving trusted/reliable (meta-)data, it
should be fair to assume all involved parties are aware that each one is
using RDFa, and aren't just running an RDFa processor in the hope to
gather enough informations.
From this point of view, it should be quite unlike to find people using
"data-rdfa-about" to express different semantics in the same page
(whereas data-property might cause ambiguity, for instance), as well as
it is (or should be) quite unlike to find namespaces using the very same
prefix involved in the same xml document (that is, I think choosing a
name including a namespace prefix for a data-* attribute (and also for a
class in a generic container as a div or a span, to tell it represents
an external element) can replicate quite safely the xml extensibility
for custom uses, to some extent, without requiring a wide support for it
in text/html document - since it seems that xhtml extensibility is not a
major concern, at least not enough to be worth merging it into html).
Just an idea, though.
However, AIUI, actual xml serialization (xhtml5) allows the use of
namespaces and prefixed attributes, thus couldn't a proper namespace be
introduced for RDFa attributes, so they can be used, if needed, in
xhtml5 documents? I think such might be a valuable choice, because it
seems to me RDFa attributes can be used to address such cases where
metadata must stay as close as possible to correspondent data, but a
mistake in a piece of markup may trigger the adoption agency or foster
parenting algorithms, eventually causing a separation between metadata
and content, thus possibly breaking reliability of gathered
informations. From this perspective, a parser stopping on the very first
error might give a quicker feedback than one rearranging misnested
elements as far as it is reasonably possible (not affecting, and instead
improving, content presentation and users' "direct" experience, but
possibly causing side-effects with metadata).
Also, if the above is true, using namespaced and prefixed attributes
instead of ones laying in the same namespace shared both by html5 and by
xhtml5 (in theory) might prevent the use of such metadata in a document
whose parsing rules might lead to possible side-effects.
> Such results have been used to suggest that poorly implemented
> features should be dropped, but this hypothetical case suggests to me
> that the argument is wrong, and that if in the face of reasons why the
> data would be bad people use them, one might expect better usage by
> formalising the status of such features and getting decent
Generally speaking, I think reasoning in terms of "poor implementation"
vs "rare usage" is quite like moving as a dog biting his own tail,
because poorly implemented features are forcedly rarely used, and rarely
used features can't convince UAs developers to implement them (in
general). But, if a feature is widely needed, several hacks may born,
thus providing an evidence of a global problem to be solved in a certain
manner by implementing a certain, well-conceived feature.
As far as I've understood it, that's the main guideline to change actual
specification, which is moving on the base of a bullet-tracing evolution
(perhaps weighted on the need for completely new features, as a balance
between the need for innovation and that for backward compatibility),
rather than a "cathedral-wise" definition of what is or can be a useful
feature to be implemented. For this reason, I think that mapping RDFa
attributes on data-rdfa-* attributes to experiment a convergence between
RDFa attributes and html5 specific features might be a start point to
get RDFa attributes both specified and widely supported by
implementations (either as they're defined in W3C Recommendation, or in
the form of data-rdfa-*, hence dealt with differently from data-*
attributes, for backward compatibility with such early implementations -
a slightly different (or somehow prefixed) name shouldn't be much of a
problem, as far as the name is not a problem per se (e.g. it is not
prone to clashes) and allows a one-to-one correspondence).
However, if a custom/small scale solution met a wide support and a deep
integration into major browsers, maybe misuses and abuses (which a
proper formalisation couldn't prevent) might become widespread, thus
making disadvantages (appear or be) greater than advantages, if measured
on a wider scale (the same as the implementation). Therefore, I think a
good start point can consist of partly introducing support on top of
existing features (in the case of RDFa, either through well-groomed,
custom data-* attributes in html5, or by defining a proper namespace
with a proper prefix for xhtml5), without requiring a deep integration
of a processor for the new feature, but instead letting it be a (custom)
plugin/extension, or an api for a (custom) web application needing it -
since a person just wishing to get access to some content without caring
of metadata and metadata reliability could just visit a page, while an
organisation wishing to interchange RDFa modelled data with another one
can run a separate processor (eventually a webapp based on a browser
built-in API, or a plugin, to create a suitable interface for queries)
to extract and merge informations.
>>>> What is the cost of having different data use specialised formats?
>>> If the data model, or a part of it, is not explicit as in RDF but is
>>> implicit in code made to treat it (as is the case with using scripts
>>> to process things stored in arbitrarily named data-* attributes, and
>>> is also the case in using undocumented or semi-documented XML
>>> formats, it requires people to understand the code as well as the
>>> data model in order to use the data. In a corporate situation where
>>> hundreds or tens of thousands of people are required to work with
>>> the same data, this makes the data model very fragile.
>> I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds
>> (xml) properties and attributes (in the form of curies) to RDF
>> concepts, modelling a certain kind of relationships, whereas it
>> relies on external schemata to define such properties. Any
>> undocumented or semi-documented XML formats may lead to misuses and,
>> thus, to unreliably modelled data,
>> I think the same applies to data-* attributes, because _they_
>> describe data (and data semantics) in a custom model and thus _they_
>> need to be documented for others to be able to manipulate them; the
>> use of a custom script rather than a built-in parser does not change
>> much from this point of view.
> RDFa binds data to RDF. RDF provides a well-known schema language with
> machine-processable definition of vocabularies, and how to merge
> information between them. In other words, if you get the underlying
> model for your data right enough, people will be able to use it
> without needing to know what you do.
> Naturally not everyone will get their data model right, and naturally
> not all information will be reliable anyway. However, it would seem to
> me that making it harder to merge the data in the first place does not
> assist in determining whether it is useful. On the other hand, certain
> forms of RDF data such as POWDER, FOAF, Dublin Core and the like have
> been very carefully modelled, and are relatively well-known and
> re-used in other data models. Making it easy to parse this data and
> merge it, according to the existing well-developed models seems valuable.
I admit I'm not very expert in RDF use, thus I have a few questions.
Specifically, maybe I can guess the advantages when using the same
(carefully modelled, and well-known) vocabulary/ies; but when two
organizations develop their own vocabularies, similar yet different, to
model the same kind of informations, is merging of data enough? Can a
processor give more than a collection of triples, to be then interpreted
basing on knowledge on the used vocabulary/ies?
I mean, I assume my tools can extract RDF(a) data from whatever
document, but my query interface is based on my own vocabulary: when I
merge informations from an external vocabulary, do I need to translate
one vocabulary to the other (or at least to modify the query backend, so
that certain curies are recognized as representing the same concepts -
e.g. to tell my software that 'foaf:name' and 'ex:someone' are
equivalent, for my purposes)? If so, merging data might be the minor
part of the work I need to do, with respect to non-RDF(a) metadata (that
is, I'd have tools to extract and merge data anyway, and once I
translated external metadata to my format, I could use my own tools to
merge data), specially if the same model is used both by mine and an
external organization (therefore requiring an easier translation).
Thus, I'm thinking the most valuable benefit of using RDF/RDFa is the
sureness that both parties are using the very same data model, despite
the possible use of different vocabularies -- it seems to me that the
concept of triples consisting of a subject, a predicate and an object is
somehow similar to a many-to-many association in a database, whereas one
might prefer a one-to-many approach - though, the former might be a
natural choice to model data which are usually sparse, as in a document
>>> Ian wrote:
>>>> For search engines, I am not convinced. Google's experience is that
>>>> natural language processing of the actual information seen by the
>>>> actual end user is far, far more reliable than any source of metadata.
>>>> Thus from Google's perspective, investing in RDFa seems like a poorer
>>>> investment than investing in natural language processing.
>>> Indeed. But Google is something of an edge case, since they can
>>> afford to run a huge organisation with massive computer power and
>>> many engineers to address a problem where a "near-enough" solution
>>> brings themn the users who are in turn the product they sell to
>>> advertisers. There are many other use cases where a small group of
>>> people want a way to reliably search trusted data.
>> I think the point with general purpose search engines is another one:
>> natural language processing, whereas being expensive, grants a far
>> more accurate solution than RDFa and/or any other kind of metadata
>> can bring to a problem requiring data must never need to be trusted
>> (and, instead, a data processor must be able to determine data's
>> level of trust without any external aid).
> No, I don't think so. Google searches based on analysis of the open
> web are *not* generally more reliable than faceted searches over a
> reliable dataset, and in some instances are less reliable.
> The point is that only a few people can afford to invest in being a
> general-purpose search engine, whereas many can afford to run a
> metadata-based search system over a chosen dataset, that responds to
> their needs (and doesn't require either publishing their data, or
> paying Google to index it).
My point is that possible assumptions over datasets reliability is the
borderline between wide-scale data extraction/classification, which is
the main problem solved by a general purpose search engine, and implies
the best assumption by default is datasets reliability is uncertain, and
(very) small-scale data modelling, were a direct and immediate
evaluation over datasets reliability is possible and easy to do, so that
a custom search engine could reliably be based on such metadata. I think
no comparison is possible between the two scales, thus no generalization
is possible when trying to guess whether metadata can do more good than
harm, but instead each case should be analysed separately, and everyone
should agree which one is the best context (eventually both) where RDFa
should be used, to understand what's the best way to implement it and if
it's worth to be introduced in html5 -- as far as I can tell, both of us
agree that small-scale is the main context.
But perhaps some edge-side case should be considered to draw a better
picture. For instance, one such case might be a browser availing of
metadata to search a resource in its local history, or within a web page
and related/linked pages (to a certain digree and level of depth),
because its scale would be small with respect to the effective number of
scanned resources, but wide with respect to the potential number of
sources for those resources, that is, because a browser implementing a
metadata extraction and merging engine and a query interface to look for
gleaned informations would deal with a | limited number | of |
etherogeneous sources | at a given time.
Once major browsers provided (and exposed by default) such a
functionality, a growing number of users would (try and) use it, thus a
growing number of sites would experiment metadata. At the beginning
everything might work fine, since only honest sites would experiment
honest metadata (such as wikis, for instance), but once the number of
sites and users availing of metadata reached a threshold point spammers
would start including spam metadata in their sites (with otherwise
trustful content) and in other sites through advertisements. Such a
scenario might lead to a bad balance between benefits and disadvantages
for the average user, thus pushing (some) browser vendors to limit or
even to wholly drop native support, and I guess this is not a wishable
solution for the Semantic Web Industry.
That is, choosing a proper level of integration for RDF(a) support into
a web browser might divide success from failure. I don't know what's the
best possible level, but I guess the deepest may be the worst, thus
starting from an external support through out plugins, or scripts to be
embedded in a webbapp, and working on top of other feature might work
fine and lead to a better, native support by all vendors, yet limited to
an API for custom applications -- whereas any changes to html to include
RDFa attributes would be fully meaningful if leading to a full support
and exposed features to avail of metadata, which I don't think is much
of a benefit for the great majority of (home) users.
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
Meetic: il leader italiano ed europeo per trovare l'anima gemella online. Provalo ora
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8291&d=9-1
More information about the whatwg