[whatwg] Annotating structured data that HTML has no semantics for
Ian Hickson
ian at hixie.ch
Sun May 10 03:32:34 PDT 2009
One of the more elaborate use cases I collected from the e-mails sent in
over the past few months was the following:
USE CASE: Annotate structured data that HTML has no semantics for, and
which nobody has annotated before, and may never again, for private use or
use in a small self-contained community.
SCENARIOS:
* A group of users want to mark up their iguana collections so that they
can write a script that collates all their collections and presents
them in a uniform fashion.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about what he teaches to add
it to their custom applications.
* The list of specifications produced by W3C, for example, and various
lists of translations, are produced by scraping source pages and
outputting the result. This is brittle. It would be easier if the data
was unambiguously obtainable from the source pages. This is a custom
set of properties, specific to this community.
* Chaals wants to make a list of the people who have translated W3C
specifications or other documents, and then use this to search for
people who are familiar with a given technology at least at some
level, and happen to speak one or more languages of interest.
* Chaals wants to have a reputation manager that can determine which of
the many emails sent to the WHATWG list might be "more than usually
valuable", and would like to seed this reputation manager from
information gathered from the same source as the scraper that
generates the W3C's TR/ page.
* A user wants to write a script that finds the price of a book from an
Amazon page.
* Todd sells an HTML-based content management system, where all
documents are processed and edited as HTML, sent from one editor to
another, and eventually published and indexed. He would like to build
up the editorial metadata used by the system within the HTML documents
themselves, so that it is easier to manage and less likely to be lost.
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush, and has either convinced
the people making the statements that they should use a common
language-neutral machine-readable vocabulary to describe their
thoughts, or has convinced some other people to come in after them and
process the thoughts manually to get them into a computer-readable
form.
REQUIREMENTS:
* Vocabularies can be developed in a manner that won't clash with future
more widely-used vocabularies, so that those future vocabularies can
later be used in a page making use of private vocabularies without
making the earlier annotations ambiguous.
* Using the data should not involve learning a plethora of new APIs,
formats, or vocabularies (today it is possible, e.g., to get the price
of an Amazon product, but it requires learning a new API; similarly
it's possible to get information from sites consistently using 'class'
values in a documented way, but doing so requires learning a new
vocabulary).
* Shouldn't require the consumer to write XSLT or server-side code to
process the annotated data.
* Machine-readable annotations shouldn't be on a separate page than
human-readable annotations.
* The information should be convertible into a dedicated form (RDF,
JSON, XML) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
* Should be possible for different parts of an item's data to be given
in different parts of the page, for example two items described in the
same paragraph. ("The two lamps are A and B. The first is $20, the
second $30. The first is 5W, the second 7W.")
* It should be possible to define globally-unique names, but the syntax
should be optimised for a set of predefined vocabularies.
* Adding this data to a page should be easy.
* The syntax for adding this data should encourage the data to remain
accurate when the page is changed.
* The syntax should be resilient to intentional copy-and-paste
authoring: people copying data into the page from a page that already
has data should not have to know about any declarations far from the
data.
* The syntax should be resilient to unintentional copy-and-paste
authoring: people copying markup from the page who do not know about
these features should not inadvertently mark up their page with
inapplicable data.
* Any additional markup or data used to allow the machine to understand
the actual information shouldn't be redundantly repeated (e.g. on each
cell of a table, when setting it on the column is possible).
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.
* Creating a custom vocabulary should be relatively easy.
* Distributed vocabulary development should be possible; it
should not require coordination through a centralised system.
* It should be possible to publish and re-use custom
vocabularies.
Some of the scenarios for this use case required substantial additions to
the spec, so before I go through the scenarios, I will describe how I
ended up putting what I did into the spec.
Let's base this on a variant of the first scenario:
* A group of users want to mark up their iguana collections so that they
can write a script that collates all their collections and presents
them in a uniform fashion.
I don't know anything about iguanas, but let's pretend it's about cats,
which should be basically the same thing from a markup perspective. Let's
say someone in the group has the following text on http://example.org/:
<section>
<h1>Hedral</h1>
<p>Hedral is a male american domestic shorthair, with a fluffy black
fur with white paws and belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo">
</section>
...and suppose that they want to make it so that the following information
can be extracted from this part of this page:
Cat name: "Hedral"
Description: "Hedral is a male american domestic shorthair, with a
fluffy black fur with white paws and belly."
Image: "http://example.org/hedral.jpeg"
Let's presume there are several authors each with their own distinct and
differently-structured pages who all want to cooperate so that a single
script can be used to amalgamate the information stored in their pages to
print a table summarising this information about their cats.
The simplest solution would be for them to define a basic vocabulary using
class names and just have them extract the data based on these class
names, a la Microformats. The markup above might become:
<section>
<h1 class="name">Hedral</h1>
<p class="desc">Hedral is a male american domestic shorthair, with a
fluffy black fur with white paws and belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo img">
</section>
This works well; a script can be written to just rip out the relevant
data. Unfortunately it has a couple of problems:
- it is likely that the class names will clash with other classes used by
other people, which would make it hard to handle situations where
different communities want to work together.
- there is no way to group information about each cat together on a page
with multiple cats.
- there is no way for a parser to know which classes are properties of
cats and which are just for styling (e.g. 'photo' used in this
example).
The first problem is easy to solve, we can just use more unique class
names, e.g. com.damowmow.name, com.damowmow.desc, and com.damowmow.img:
<section>
<h1 class="com.damowmow.name">Hedral</h1>
<p class="com.damowmow.desc">Hedral is a male american domestic
shorthair, with a fluffy black fur with white paws and belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo com.damowmow.img">
</section>
The second is harder to solve. We could introduce a class name that says
"this is a cat", much like the way Microformats do, but then parsers would
need to know about the "root class names" (to use the Microformats term)
to know where to begin collecting data.
The third problem is somewhat of a blocker, and is the most common
complaint I heard from people about Microformats.
Another solution we could consider is RDFa:
<section typeof="d:cat" xmlns:d="http://damowmow.com/">
<h1 property="d:name">Hedral</h1>
<p property="d:desc">Hedral is a male american domestic shorthair,
with a fluffy black fur with white paws and belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo" rel="d:img">
</section>
This unfortunately also has a number of problems.
- it uses prefixes, which most authors simply do not understand, and
which many implementors end up getting wrong (e.g. SearchMonkey
hard-coded certain prefixes in its first implementation, Google's
handling of RDF blocks for license declarations is all done with
regular expressions instead of actually parsing the namespaces, etc).
Even if implemented right, namespaces still lead to flaky
copy-and-paste behaviour.
- it sometimes uses rel="" and sometimes uses property="" and it's hard
to know when to use one or the other.
- it introduces much more power than is necessary to solve this problem.
We can fix the first problem by using URLs instead, and using the
xmlns:http trick to keep things working in RDFa processors:
<section typeof="http://damowmow.com/cat" xmlns:http="http:">
<h1 property="http://damowmow.com/name">Hedral</h1>
<p property="http://damowmow.com/desc">Hedral is a male american
domestic shorthair, with a fluffy black fur with white paws and
belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo" rel="http://damowmow.com/img">
</section>
This, though, is quite ugly. An alternative is to go back to the non-URI
class names we had above. This doesn't break compatibility with the RDFa
processors, because when there is no colon in the property="" or rel=""
attributes, the RDFa processors just ignore the values (this is the "no
prefix" mapping of CURIEs):
<section typeof="com.damowmow.cat">
<h1 property="com.damowmow.name">Hedral</h1>
<p property="com.damowmow.desc">Hedral is a male american
domestic shorthair, with a fluffy black fur with white paws and
belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo" rel="com.damowmow.img">
</section>
At this point though we might as well change rel="" to property="" and
just say that on <img> it takes the value from src="", because that's
going to be a lot more understandable, and won't affect RDFa processors
(which at this point are ignoring the values):
<section typeof="com.damowmow.cat">
<h1 property="com.damowmow.name">Hedral</h1>
<p property="com.damowmow.desc">Hedral is a male american
domestic shorthair, with a fluffy black fur with white paws and
belly.</p>
<img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
class="photo" property="com.damowmow.img">
</section>
This has us past the first two problems listed above for RDFa, but we
still have the third, if we were to adopt the rest of RDFa -- way too much
power to solve this problem. So instead, let's assume we're sticking with
just these two attributes, typeof="" and property="", in the "prefix-less"
CURIE space. This gives us a very small subset of RDFa to start from.
Let's see if this works well for other ways of marking up cat data:
<body typeof="com.damowmow.cat">
<p>I love my cats. My oldest cat is <span
property="com.damowmow.name">Silver</span>. <span
property="com.damowmow.desc">Silver is <span
property="com.damowmow.age">11</span> years old and refuses to eat
alone, always waiting for either Yellow or Blue to eat with
him.</p>
</body>
This seems fine. This example shows overlapping properties, so that data
can be extracted from within the value of another.
At this point, I would like to introduce another example, also taken from
the scenarios above, to examine if this works well in other scenarios:
* The list of specifications produced by W3C, for example, and various
lists of translations, are produced by scraping source pages and
outputting the result. This is brittle. It would be easier if the data
was unambiguously obtainable from the source pages. This is a custom
set of properties, specific to this community.
Let's presume that each spec has the following structure:
Name:
URL:
Status:
Level:
Publication Date:
Comments Deadline:
Working Group:
This might be encoded as:
<div typeof="org.w3.spec">
<h1 property="org.w3.name">HTML5</h1>
<a property="org.w3.url" href="/TR/html5">Current Version</a>
<div property="org.w3.status">
<p>Level: <span property="org.w3.level">WD</span>
<p>Date: <span property="org.w3.pubdate">03/02/2009</span>
<p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
</div>
<p>Working Group: <span property="org.w3.wg">HTMLWG</span>
</div>
There is a problem here. The original structure has "status" as a property
with three sub-properties, but according to the "minimal RDFa" convention
we've come up with so far, the markup here has "status" as containing a
long string and the other properties are at the top level, like this:
Name: "HTML5"
URL: "/TR/html5"
Status: "Level: WD Date: 2009-03-02 Deadline: 2010-01-01"
Level: "WD"
Publication Date: "03/02/2009"
Comments Deadline: "02/03/2009"
Working Group: "HTMLWG"
What we need is a way to say that a property is really a new set of
name-value pairs. If we look at the current markup, we are using typeof=""
for this, but typeof="" expects a type as its value, whereas here we
don't need an explicit type. We could just say typeof="" can be used
without a type:
<div typeof="org.w3.spec">
<h1 property="org.w3.name">HTML5</h1>
<a property="org.w3.url" href="/TR/html5">Current Version</a>
<div property="org.w3.status" typeof>
<p>Level: <span property="org.w3.level">WD</span>
<p>Date: <span property="org.w3.pubdate">03/02/2009</span>
<p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
</div>
<p>Working Group: <span property="org.w3.wg">HTMLWG</span>
</div>
But that doesn't make much sense and isn't allowed in RDFa anyway.
Let's rename "typeof" to "item" instead:
<div item="org.w3.spec">
<h1 property="org.w3.name">HTML5</h1>
<a property="org.w3.url" href="/TR/html5">Current Version</a>
<div property="org.w3.status" item>
<p>Level: <span property="org.w3.level">WD</span>
<p>Date: <span property="org.w3.pubdate">03/02/2009</span>
<p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
</div>
<p>Working Group: <span property="org.w3.wg">HTMLWG</span>
</div>
Now the result makes more sense:
Name: "HTML5"
URL: "/TR/html5"
Status:
Level: "WD"
Publication Date: "03/02/2009"
Comments Deadline: "02/03/2009"
Working Group: "HTMLWG"
While we're at it, we should use the <time> element for encoding those
dates, so that we don't have work out if the deadline is before the
publication date or if these are weird US-format dates:
<div item="org.w3.spec">
<h1 property="org.w3.name">HTML5</h1>
<a property="org.w3.url" href="/TR/html5">Current Version</a>
<div property="org.w3.status" item>
<p>Level: <span property="org.w3.level">WD</span>
<p>Date: <time property="org.w3.pubdate" datetime="2009-02-03">03/02/2009</time>
<p>Deadline: <time property="org.w3.deadline" datetime="2009-03-02">02/03/2009</time>
</div>
<p>Working Group: <span property="org.w3.wg">HTMLWG</span>
</div>
(So, <img> gets its content from src="", <time> gets its content from
datetime="", and everything else gets its content from the element's own
contents, serialised to text.)
One of the requirements is the following:
* Should be possible for different parts of an item's data to be given
in different parts of the page, for example two items described in the
same paragraph. ("The two lamps are A and B. The first is $20, the
second $30. The first is 5W, the second 7W.")
Here we have two "items" (to use the terminology that the attribute name
above implies), but they are mixed together.
<p>The two lamps are A and B. The first is $20, the
second $30. The first is 5W, the second 7W.</p>
The proposal so far doesn't handle this:
<p item>The two lamps are <span
property="com.example.name">A</span> and <span ???>B</span>. The
first is <span property="com.example.price">$20</span>, the second
<span ???>$30</span>. The first is <span
property="com.example.power">5W</span>, the second <span
???>7W</span>.</p>
What do we put for the "???"s?
In RDFa we would have to give each of these items a name using about=""
and then link them that way, but this has two problems: these items aren't
named, nor is there an obvious name to give them, and it makes it hard to
identify a single element responsible for introducing an item, which would
make DOM interfaces to this somewhat more complex.
Microformats aren't especially useful here for inspiration as they do not
solve this problem at all.
Instead, let us try using the regular "IDREF" functionality that HTML uses
in a variety of other places, like <label for="">. For this we'll need a
new attribute, but unfortunately we can't use about="" (which would be the
obvious name to use), because that would conflict with RDFa, so instead
we'll use subject="":
<p>The two lamps are <span item id=a><span
property="com.example.name">A</span></span> and <span item
id=b><span property="com.example.name">B</span></span>. The first
is <span subject="a" property="com.example.price">$20</span>, the
second <span subject="b"
property="com.example.price">$30</span>. The first is <span
subject="a" property="com.example.power">5W</span>, the second
<span subject="b" property="com.example.power">7W</span>.</p>
This is somewhat verbose, but I think this example represents pretty much
the worst case scenario -- in most cases, most of the content would be in
one place with just the odd other property elsewhere on the page.
With this proposal in mind, let's now go through the scenarios.
* A group of users want to mark up their iguana collections so that they
can write a script that collates all their collections and presents
them in a uniform fashion.
Assuming Iguanas aren't too different from cats, here's an example of what
this could look like:
Page 1:
<section item="com.damowmow.cat">
<h1 property="com.damowmow.name">Hedral</h1>
<p property="com.damowmow.desc">Hedral is a male american domestic
shorthair, with a fluffy black fur with white paws and belly.</p>
<img property="com.damowmow.img" src="hedral.jpeg" alt="" title="Hedral, age 18 months">
</section>
Page 2:
<body item="com.damowmow.cat">
<p>I love my cats. My oldest cat is <span
property="com.damowmow.name">Silver</span>. <span
property="com.damowmow.desc">Silver is <span
property="com.damowmow.age">11</span> years old and refuses to eat
alone, always waiting for either Yellow or Blue to eat with
him.</p>
</body>
Page 3:
<h2>My Cats<h2>
<dl>
<dt>Schrödinger
<dd item="com.damowmow.cat">
<meta property="com.damowmow.name" content="Schrödinger">
<meta property="com.damowmow.age" content="9">
<p property="com.damowmow.desc">Orange male.
<dt>Erwin
<dd item="com.damowmow.cat">
<meta property="com.damowmow.name" content="Lord Erwin">
<meta property="com.damowmow.age" content="3">
<p property="com.damowmow.desc">Siamese color-point.
<img property="com.damowmow.img" alt="" src="/images/erwin.jpeg">
</dl>
(Note the <meta>s in the last example -- since sometimes the information
isn't visible, rather than requiring that people put it in and hide it
with display:none, which has a rather poor accessibility story, I figured
we could just allow <meta> anywhere, if it has a property="" attribute.)
Assuming a Perl library HTML::Microdata with the following API (which
could be written on top of html5lib, for instance):
# $doc is an HTML document instance
my $microdata = new HTML::Microdata($doc);
my @items = $microdata->items() # returns a list of items
my @x = $microdata->items('x') # returns a list of items with that type
my $properties = $item->properties() # returns a hashref of name-[value] pairs
...then the following could generate a table of all the cats:
my @cats;
# @docs is a list of pre-parsed document instances
foreach my $doc (@docs) {
my $data = new HTML::Microdata($doc);
push(@cats, $data->items('com.damowmow.cat'));
}
print "<table><thead><tr><th>Name<th>Age<th>Image<th>Description<tbody>";
foreach my $cat (@cats) {
my $name = '—';
if (exists $cat->properties->{'com.damowmow.name'}) {
$name = escapeHTML($cat->properties->{'com.damowmow.name'}->[0]);
}
my $age = '';
if (exists $cat->properties->{'com.damowmow.age'}) {
$age = escapeHTML($cat->properties->{'com.damowmow.age'}->[0]);
}
my $image = '';
if (exists $cat->properties->{'com.damowmow.img'}) {
my $src = escapeHTML($cat->properties->{'com.damowmow.img'}->[0]);
$image = "<img src='$src' alt=''>";
}
my $description = '';
if (exists $cat->properties->{'com.damowmow.desc'}) {
$description = escapeHTML($cat->properties->{'com.damowmow.desc'}->[0]);
}
print "<tr><td>$name<td>$age<td>$image<td>$description";
}
print "</table>";
This seems pretty simple to use.
* A scholar and teacher wants other scholars (and potentially students)
to be able to easily extract information about what he teaches to add
it to their custom applications.
* The list of specifications produced by W3C, for example, and various
lists of translations, are produced by scraping source pages and
outputting the result. This is brittle. It would be easier if the data
was unambiguously obtainable from the source pages. This is a custom
set of properties, specific to this community.
These seem to be more or less the same.
* Chaals wants to make a list of the people who have translated W3C
specifications or other documents, and then use this to search for
people who are familiar with a given technology at least at some
level, and happen to speak one or more languages of interest.
It seems that, given a tool that can parse this microdata format from HTML
pages, it would be reasonably straight-forward to write a tool to search
the data for particular patterns.
* Chaals wants to have a reputation manager that can determine which of
the many emails sent to the WHATWG list might be "more than usually
valuable", and would like to seed this reputation manager from
information gathered from the same source as the scraper that
generates the W3C's TR/ page.
The reputation manager and e-mail processing is out of scope here, but the
microdata format would address the problem of scraping the data in a less
brittle manner than would be required today.
* A user wants to write a script that finds the price of a book from an
Amazon page.
This wouldn't be helped my the microdata solution proposed above, since
Amazon apparently do not wish to expose the data in this fashion, but
instead the Amazon API can be used for this purpose, so there is an
alternative solution here.
* Todd sells an HTML-based content management system, where all
documents are processed and edited as HTML, sent from one editor to
another, and eventually published and indexed. He would like to build
up the editorial metadata used by the system within the HTML documents
themselves, so that it is easier to manage and less likely to be lost.
It's unclear exactly what metadata this would involve, but the microdata
solution described above seems like it would work for this.
* Tim wants to make a knowledge base seeded from statements made in
Spanish and English, e.g. from people writing down their thoughts
about George W. Bush and George H.W. Bush, and has either convinced
the people making the statements that they should use a common
language-neutral machine-readable vocabulary to describe their
thoughts, or has convinced some other people to come in after them and
process the thoughts manually to get them into a computer-readable
form.
The knowledge base aspect of this is out of scope for HTML, but if someone
is able to convince people to mark up sentences in HTML, the microdata
solution above would, as far as I can tell, provide sufficient flexibility
to mark up such features.
The requirements were as follows:
* Vocabularies can be developed in a manner that won't clash with future
more widely-used vocabularies, so that those future vocabularies can
later be used in a page making use of private vocabularies without
making the earlier annotations ambiguous.
The use of reversed domain labels makes this possible without needing
prefixes and without making the page unreadable, so this seems satisfied.
* Using the data should not involve learning a plethora of new APIs,
formats, or vocabularies (today it is possible, e.g., to get the price
of an Amazon product, but it requires learning a new API; similarly
it's possible to get information from sites consistently using 'class'
values in a documented way, but doing so requires learning a new
vocabulary).
This need isn't met. There would still be a plethora of vocabularies to
learn. I don't see how to satisfy this requirement short of defining one
very comprehensive vocabulary and requiring that only that vocabulary be
used, but that seems like an impractical solution.
* Shouldn't require the consumer to write XSLT or server-side code to
process the annotated data.
This requirement isn't met. While XSLT itself isn't necessarily needed,
there is still typcially a need for code to process the parsed annotations
to do something with it.
This requirement was later clarified to mean that it should be possible to
write a single parser that isn't continually maintained in the way that
the Microformats parsers have to be as new vocabularies are added. That
_is_ handled by this new syntax, so hopefully that will be enough.
* Machine-readable annotations shouldn't be on a separate page than
human-readable annotations.
This requirement is met.
* The information should be convertible into a dedicated form (RDF,
JSON, XML) in a consistent manner, so that tools that use this
information separate from the pages on which it is found have a
standard way of conveying the information.
I've included a section on converting microdata blobs to JSON and a
section on converting HTML documents, including microdata blobs, to RDF,
so that there is a consistent representation of HTML documents in those
formats. I haven't defined an XML form of the microdata syntax, since HTML
can be expressed as XML already, and I'm not sure that a dedicated format
for microdata would be particularly wieldy.
* Should be possible for different parts of an item's data to be given
in different parts of the page, for example two items described in the
same paragraph. ("The two lamps are A and B. The first is $20, the
second $30. The first is 5W, the second 7W.")
This was discussed above.
* It should be possible to define globally-unique names, but the syntax
should be optimised for a set of predefined vocabularies.
The syntax allows URIs and encourages use of Java-like identifiers, but
some of the use cases that I haven't handled yet are about dedicated
vocabularies for particular purposes, so I'll return to this soon.
* Adding this data to a page should be easy.
Based on the examples I've written above, as well as a number of others
I've written while developing this, it seems that this is a pretty easy
syntax to use. It is easier than Microformats for people who already use
"class" heavily, and it is easier than RDFa (it's a loose subset).
* The syntax for adding this data should encourage the data to remain
accurate when the page is changed.
The syntax encourages the use of inline metadata for this very purpose,
while allowing hidden metadata for the cases where trying to shoe-horn
microdata into visible form is counterproductive.
* The syntax should be resilient to intentional copy-and-paste
authoring: people copying data into the page from a page that already
has data should not have to know about any declarations far from the
data.
Since microdata doesn't use prefixes, it's quite resilient to copy/paste.
* The syntax should be resilient to unintentional copy-and-paste
authoring: people copying markup from the page who do not know about
these features should not inadvertently mark up their page with
inapplicable data.
This isn't really met. It's very easy to copy an element with a
property="" or item="" attribute and not realise that more is being said
by the markup than the author realises. I'm not sure how to address this.
This problem is also present in RDFa and Microformats, but my discussions
with people who have worked with and on those technologies did not reveal
any obvious solutions that didn't involve significantly reducing ease of
authoring.
* Any additional markup or data used to allow the machine to understand
the actual information shouldn't be redundantly repeated (e.g. on each
cell of a table, when setting it on the column is possible).
This isn't met at all with the current proposal. Unfortunately the only
general solutions I could come up with that would allow this were
selector-based, and in practice authors are still having trouble
understanding how to use Selectors even with CSS.
* Parsing rules should be unambiguous.
The parsing rules for microdata are clearly set out in the spec.
* Should not require changes to HTML5 parsing rules.
The HTML5 parsing rules are untouched.
* Creating a custom vocabulary should be relatively easy.
Creating a good vocabulary is always a lot of work, but in general the
microdata syntax doesn't make it harder, at least. I haven't yet addressed
the issue of validating the vocabulary, which is one of the use cases that
I haven't yet dealt with.
* Distributed vocabulary development should be possible; it
should not require coordination through a centralised system.
This is handled through the use of URIs or domain-label-based identifiers.
* It should be possible to publish and re-use custom
vocabularies.
Nothing in the syntax seems to preclude this.
In conclusion:
To address this use case and its scenarios, I've added to HTML5 a simple
syntax (three new attributes) based on RDFa. It doesn't have the full
power of RDF, because that didn't seem to be necessary to address the use
cases. It doesn't really have anything in common with Microformats; I
didn't find the Microformats syntax to be very convenient. (This was also
the experience with eRDF.)
I expect the syntax will need adjustments over the coming weeks to address
issues that I overlooked. I look forward to such feedback.
A number of further use cases remain to be examined, including one
with scenarios regarding validating custom vocabularies and allowing
editors to provide help with custom vocabularies, and including
several regarding specific vocabularies such as events, contact
information, and bibliographic information. I will send further e-mail
next week as I address them.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list