[whatwg] Microdata feedback

Ian Hickson ian at hixie.ch
Tue May 31 16:17:07 PDT 2011

On Tue, 15 Mar 2011, Hay (Husky) wrote:
> Consider, for example, a list that contains custom data that needs to be 
> displayed using Javascript. In most cases, the data-* attributes are a 
> nice way to embed non-visual data to be read out later, but that doesn't 
> work for hierarchical structures.

You can use nested HTML elements with data-* attributes. For example, the 
JSON value {a:{b:'x',y:'z'}} could be represented as:

   <div data-a>
    <div data-b='x'></div>
    <div data-c='y'></div>

> 1) Microdata. This could work, but only if the data should be displayed 
> as well. If the data should be processed (and for example, be shown in 
> another part of the page) this doesn't work really well. You could the 
> hide the parent element with CSS, but that's pretty clunky.

You can use microdata with <link> and <meta> if you don't want to actually 
show the information.

Hiding information in a page (whether it's in microdata, data-*, <script> 
data blocks, or anything else) if generally frowned upon as the data tends 
to end up less accurate than if it's visible, but that's something you 
have to take under advisement.

> 2) data-* attributes with JSON data. Would work, but pretty ugly and not 
> very expressive and readable.


> 3) 'Data blocks' in a <script> element, as described in the spec. You 
> can, for example, use a <script> element with an 'application/xml' or 
> 'application/json' type, which will not be processed by a browser, and 
> use Javascript to do the parsing.


> Data blocks seem to be the current most usable solution, but i get the 
> feeling that it's not quite expressive enough. <script> tags should 
> contain scripts, and XML and JSON are not scripting languages. 

<script> elements can contain either scripts, or data blocks. Both are 
allowed and equally valid.

> Semantically, 'xml islands' as used in older versions of IE seem like an 
> elegant solution but are not supported in other browsers.

I'm not sure what you mean by "semantically" here.

> Something like a <data> tag, with a required 'type' attribute that can 
> contain a mime type that could indicate to browsers what type of content 
> follows might be an interesting solution, but i'm not sure what the 
> implications might be. It would be cool if you do a <data 
> type="application/json"> and get an automatic Javascript object when 
> getting the element with document.getElementById.

That's what <script> data blocks are (except for the automatic JS object 
part, but that's easy to add with a script library).

On Mon, 21 Mar 2011, Philip Jägenstedt wrote:
> On http://foolip.org/microdatajs/live/#json I have a "Download it!" 
> function which uses data: URLs to save JSON generated by JavaScript. The 
> only real limitation with this approach is that one cannot suggest a 
> file name, so in Opera the suggested file name is "default".
> Are there other ways to save script-generated data that don't involve 
> bouncing against a server? If not, any ideas how one might fix this with 
> data: URLs? The Content-Disposition header solves this problem, but 
> can't be applied here.

I recommend extending the data: spec to have a filename="" parameter.

On Mon, 21 Mar 2011, Boris Zbarsky wrote:
> For your particular use case, a "disposition" attribute on <a> or 
> something would also work; I'm pretty sure that this has come up before 
> as well.

Is it a common enough use case to warrant an extension to HTML?

On Fri, 15 Apr 2011, Justin Karneges wrote:
> I'm desiring a way to markup "mentions of a person" semantically within 
> HTML, for use in an open standard.  Think of a more rich form of the 
> @person convention used on Twitter and elsewhere:
> <p>@justin I totally agree</p>
> My first thought was to use a data-* attribute.  For example:
> <p><a href="http://example.org/justin/" data-mention- 
> id="acct:justin at example.org" data-mention-context="reply">justin</a> I 
> totally agree.</p>
> However, the HTML specification says custom data attributes are only to 
> be used privately.  So, I am not sure if it is appropriate to create a 
> public standard whereby independent developers are encouraged to utilize 
> a common data-* attribute.
> Another way is to use Microdata, though I seem to have to hack it a bit 
> to have hidden values:
> <p><a href="http://example.org/justin/" itemscope 
> itemtype="http://example.org/itemtypes/mention" 
> itemid="acct:justin at example.org"><span itemprop="context:reply"/><span 
> itemprop="name">justin</span></a> I totally agree.</p>

I'm not sure exactly what you're trying to do there (the markup seems to 
be invalid and I'm not sure what you intended), but there's no need to 
hack anything to have hidden data in microdata, just use <meta> or <link>.

On Mon, 18 Apr 2011, Justin Karneges wrote:
> Yes, this is meant to be processed by machines, as part of a data 
> exchange protocol.  It is not browser-specific.  For example, this kind 
> of HTML formatting may find its way into an Atom feed, or even an XMPP 
> message.  It is not expected that this format would be shoved directly 
> to a browser for render (although, if someone does that, ideally it 
> should degrade gracefully, hence the use of <a> around the name).
> Here are two things I'd expect apps to do:
>   1) Render the mentions in a special way.  For example, our application 
> shows the mentioned name inside of a colored, button-looking box with an 
> icon image based on the domain of the person being mentioned.  This kind 
> of presentation- level detail would not be encoded in the HTML itself.
>   2) Keep display names up to date.  In the event that a user changes 
> his/her name, but the account id is not changed, future replays or 
> retransmissions of this HTML may contain different name text (the 
> 'justin' part in my example).  For example, an aggregator may track name 
> changes, and update its cached HTML accordingly rather than holding onto 
> stale names.
> Regarding #2, it may also be useful for servers that persist this data 
> to do so without saving any name text at all (imagine the 'a' element in 
> the earlier example having no cdata).  Whenever the HTML blob is 
> extracted from the db, it would need to be stamped with the name of the 
> mentioned user before sending out to a client.
> So I take it that using data-* for this is not recommended?

For this kind of thing, I'd recommend a Microformat or microdata.

On Mon, 18 Apr 2011, Justin Karneges wrote:
> Now we have this:
> <p>
>   <span itemscope itemtype="http://data-vocabulary.org/Person" 
> itemid="acct:justin at example.org">
>     <meta itemprop="context" content="reply"/>
>     <a itemprop="url" href="http://example.org/justin/"><span 
> itemprop="name">justin</span></a>
>   </span> I totally agree.
> </p>

Yup, that works.

On Tue, 26 Apr 2011, Benjamin Hawkes-Lewis wrote:
> So the extractable data is: "the United Nations is the source of the 
> quotation 'We the Peoples of the United Nations determined to save 
> succeeding generations from the scourge of war, which twice in our 
> lifetime has brought untold sorrow to mankind...'"?
> Microdata (like microformats) is supposed to encourage visible data 
> rather than hidden metadata. Both your markup examples seem to express 
> authorship of the quotation through hidden metadata. A visible data 
> approach might look something like:
>   <blockquote
>     cite="http://www.un.org/en/documents/charter/preamble.shtml"
>     itemscope
>     itemtype="http://tei-vocabulary.example.com/cit"
>     itemprop="quote">
>       We the Peoples of the
>       <span itemprop="bibl">
>         <span itemprop="author"
>               itemtype="http://tei-vocabulary.example.com/orgName">

This itemtype="" is invalid, for what it's worth. (itemtype="" only makes 
sense on an element with itemscope="".)

>           United Nations
>         </span>
>       </span> determined to save succeeding
>       generations from the scourge of war, which twice in our lifetime has
>       brought untold sorrow to mankind...
>   </blockquote>

On Wed, 27 Apr 2011, Brett Zamir wrote:
> Thanks for the references. While this may be relevant for the likes of 
> blogs and other documents whose requirements for semantic density is 
> limited enough to allow such reshaping for practical effect and whose 
> content is reshapeable by the content creator (as opposed to 
> republishing of already completed books), for more semantically dense 
> content, such as the types of classical documents marked up by TEI, it 
> is simply not possible to expose text for each bit of semantic 
> information or to generate new text to meet that need. And of course, 
> even with microformats/microdata as it is now, the semantic content 
> itself is not necessarily exposed just because text is visible on the 
> page.
> The issue of discoverability is I think more related to how it will be 
> consumed or may be consumed. And even if some pieces of information are 
> less discoverable, it does not mean that they have no value. For such 
> rich documents, a lot of attention is being paid to these texts since 
> they are themselves considered important enough to be worth the time.
> If the Declaration of Independence of the United States was marked up 
> with hidden information about prior emendations, their likely reasons, 
> etc., or about suspected authors of particular passages, or the United 
> Nations Declaration of Human Rights were marked up to indicate which 
> countries have expressed reservations (qualifications) about which 
> rights, while a browsing application or query tool ought to be able 
> (optionally) expose this hidden information, there is no automatic need 
> for the markup to be polluted with extra (hidden) (and especially 
> URI-based or other non-textual) tags when an attribute would suffice.
> For things that are truly important, there may be a great deal of care 
> put into building up many layers which are meant to be peeled away, and 
> it is worth allowing some of that information (particularly the 
> non-textual information, e.g., the conditions of authorship, publisher, 
> etc.), especially which the original publication did not expose, to be 
> still selectively revealed to queries and deeper browsing.
> If a site like Wikisource (the online library sister project of 
> Wikipedia's) would be able to offer such officially sanctioned semantic 
> attributes, classic texts could become enhanced in this way over time, 
> with the wiki exposing the hidden semantic information, which indeed may 
> not be as important as the visible text, but with queries by interested 
> to users, any problems in encoding could be discovered just as well.
> While I know most hip web authors and developers are minimalists, can't 
> we all just get along? Can't those of us interested in such richness, 
> and with a view to progressively enhancing documents into the far 
> future, also be welcomed into the web?

I'm not really sure what you're proposing here. If you are advocating for 
a change to the HTML specification, could you elaborate on that aspect of 
your idea?

On Thu, 26 May 2011, Tab Atkins Jr. wrote:
> On Thu, May 26, 2011 at 12:02 PM, Guha <guha at google.com> wrote:
> > We are trying to simplify statement of a fairly common thing that crops up
> > with microdata
> >
> > E.g.,
> >
> > Consider the block:
> > 1) <div itemscope itemtype=”http://schema.org/Book”>
> >      <span itemprop=”name”>The Catcher in the Rye</span> -
> >     by <span itemprop=”hasAuthor”>J.D. Salinger</span>
> >   </div>
> >
> > Now, the site wants to use the wikipedia (or freebase) entry for Salinger,
> > just to be clear and wants the value of the  hasAuthor property to be an
> > item with that ID.
> > I believe the following says that:
> >
> > 2) <div itemscope itemtype=”http://schema.org/Book”>
> >      <span itemprop=”name”>The Catcher in the Rye</span> -
> >   by <a href="http://en.wikipedia.org/wiki/J._D._Salinger"
> > itemprop=”hasAuthor”>J.D.
> > Salinger</a>
> >   </div>

That says that the "hasAuthor" is the URL 
"http://en.wikipedia.org/wiki/J._D._Salinger". The vocabulary could 
further define that that means that that URL is the itemid="" if an item 
that, if found, contains information about that "hasAuthor"; that's up to 
the person specifying the vocabulary.

> > Often, the site does not want to link out to the wikipedia (or other
> > canonical url) page, but only specify
> > it in the microdata. This can be done by:
> >
> > 3) <div itemscope itemtype=”http://schema.org/Book”>
> >     <span itemprop=”name”>The Catcher in the Rye</span> -
> >   by <span itemscope
> > itemid="http://en.wikipedia.org/wiki/J._D._Salinger"
> > itemprop=”hasAuthor”>J.D.
> > Salinger</span>
> >  </div>

itemid="" currently only makes sense in the context of a vocabulary (given 
by itemtype=""), and currently only the vocabulary of the item itself can 
define it, not the vocabulary of the "parent" item (as in this example). 

But it's unnecessary to go to these lengths to get the same effect as 
above but with the link being hidden, one can just use <link>, as Tab 

> The correct way to solve this case is with markup like this:
> <div itemscope itemtype="http://scheme.org/Book">
>   <span itemprop="name">The Catcher in the Rye</span> -
>   by J.D. Salinger
>   <link itemprop="hasAuthor" href="http://en.wikipedia.org/wiki/J._D._Salinger">
> </div>


Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list