[whatwg] Link rot is not dangerous

Fri May 15 21:31:15 PDT 2009

2009/5/15 Laurens Holst <laurens.nospam at grauw.nl>:
> Tab Atkins Jr. schreef:
>>
>> Assume a page that uses both foaf and another vocab that subclasses
>> many foaf properties.  Given working lookups for both, the rdf parser
>> can determine that two entries with different properties are really
>> 'the same', and hopefully act on that knowledge.
>>
>> If the second vocab 404s, that information is lost.  The parser will
>> then treat any use of that second vocab completely separately from the
>> foaf, losing valuable semantic information.
>>
>
> If the subclass-vocabulary is public, then it is most likely already well
> taken care of by the owner and also archived in several places, and thus
> hard to get lost. If the subclass-vocabulary is one custom-built for a
> specific site, then it is likely already stored in the same location.
>
> But even if you had RDF data without ontology, it is still far from useless.
> In fact, I’d say most RDF consumers today do not really do any kind of
> reasoning, which is what you primarily need an ontology for, especially not
> the large consumers. Without ontology you can still determine types, query
> their properties whose names are often self-explanatory, compare resources
> for equality, etc.
>
> Knowledge of the ontology will be embedded in documentation and existing
> software that consumes the data. Let me remark that when you end up in this
> scenario, you still basically got the same as what microformats have to work
> with. And if need be, you could even manually construct a schema.
>
> But yes, if everything goes awry, then data can get lost. That is the nature
> of the web. It is like, if snap.com goes out of business, all sites using
> those annoying popups will cease to show them (hurray!). A question you
> could pose is, if ‘the web’ allowed the data to get lost, whether that data
> is really important anyway.
>
> Maybe it would ease your mind if people set up a bunch of servers which
> spider the web of data for ontology schemas, archives them and provides a
> querying mechanism? If such a thing does not exist already.
>
> Either way, I guess kind of the basic idea is that, dereferencibility of RDF
> URIs is a convenient bonus, not a necessity, RDF can work completely
> offline. There is no requirement that ontologies must be retrieved from the
> ontology’s URIs or that there must be an ontology at all.

Believe me, Laurens, *I* know this.  I know that public vocabs will be
publicly known and consumable, and private vocabs don't need to be
(because the few people using them know them and can consume them).
But the automated discoverability of RDF has been touted as a major
reason why RDFa specifically has to be supported in HTML5 (certainly
not the only major reason, but it's been harped on plenty), and link
rot *does* significantly affect that, *especially* for the small
vocabs that aren't likely to be widely reproduced.  It's a common
thing that *will* happen, as Philip's data shows, and as anyone
familiar with web history is aware of.  The web rots over time, no
matter what you do, and there's no way to form canonical identifiers
that will stand the test of time.

Automated discovery is a benny in RDF's favor.  It's probably not a
*downside*, after all (though there were some negative scenarios
brought up concerning this a few months ago, such as a domain falling
into new hands who maliciously modify the schema).  But I think it's a
very minor point, and the fact that few if any major consumers of RDF
actually use this ability supports this thought.  There's little to no
in-the-wild use cases for this sort of ability, which means that it is
very low priority when determining what solution will be specced.

Once you remove discovery as a strong requirement, then you remove the
need for large urls, and that removes the need for CURIEs, or any
other form of prefixing.  You still want to uniquify your identifiers
to avoid accidental clashes, but that's not that hard, nor is it
absolutely necessary.  The system can be robust and usable even with a
bit of potential ambiguity if small authors design their private
vocabs badly.  As a bonus, everything gets simpler.  Essentially it
devolves into something relatively close to Ian's microdata proposal,
perhaps with datatype added in (though I do question how necessary
that is, given a half-intelligent parser can recognize things as
numbers or dates).

~TJ