[whatwg] Helping people seaching for content filtered by license

Tue Jul 7 16:20:46 PDT 2009

On Tue, 9 Jun 2009, Jeff Walden wrote:
> On 8.6.09 17:33, Ian Hickson wrote:
> > >
> > > - Search engines shouldn't be the gatekeeper when it comes to "valid"
> > >   and "invalid" licenses. New licenses shouldn't be discouraged as
> > >   they're vital to keep up with ever changing laws around the world. I
> > >   don't want to wait around for search engines to decide that supporting
> > >   a particular license is in their best interest.
> > 
> > New licenses absolutely need to be discouraged. License proliferation is a
> > huge problem. Any solution we come up with here must absolutely be
> > designed in such a way as to make introducing a new license have a high
> > cost, because each new license causes further fragmentation of the
> > creative world.
> 
> Notwithstanding that I agree that the world has a surfeit of licenses and
> generally gains little from new ones:
> 
> What core principles for the technical design of HTML and the web dictate that
> license proliferation is a problem?

The "Do the right thing" principle. :-)

> Are there reasons to think that HTML5's implicit encouragement or 
> discouragement of recognizing (or having the capability to recognize) 
> many licenses would be effective at discouraging license creation, or 
> more effective than more targeted actions in a separate venue to combat 
> creation of new licenses?

I don't think it would be more effective than more targeted actions, but I 
do think that if we make HTML5 make it easy to introduce new licenses, 
that this will encourage new licenses to exist.

> Maybe I'm the only person who thinks it (I'd like to hope I'm merely the 
> only person to say it, unless I've missed its mention in the past), but 
> this feels like mission creep to me.

I don't think it's mission creep, it's just a help in figuring out what 
our design for a feature should be.

On Wed, 10 Jun 2009, Eduard Pascual wrote:
> On Fri, May 8, 2009 at 9:57 PM, Ian Hickson <ian at hixie.ch> wrote:
> >
> > This has some implications:
> >
> > - Each unit of content (recipe in this case) must have its own
> >   independent page at a distinct URL. This is actually good practice
> >   anyway today for making content discoverable from search engines, 
> >   and it is compatible with what people already do, so this seems 
> >   fine.
> 
> This is, on a wide range of cases, entirely impossible: while it might 
> work, and maybe it's even good practice, for contents that can be 
> represented on the web as a HTML document, it is not achievable for many 
> other formats.

Fair enough. Let's look at the use case again then:

   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019668.html

...but adding two requirements; first the one from you:

 * It should be possible to find works that do not have their own pages, 
   e.g. images in a library, or software.

...and next a requirement that falls out of this, which isn't needed when 
resources have their own page, but is needed when multiple resources 
share a page:

 * Each resource needs to have the following information:
    - The URL to the resource, for identification purposes
    - The identifier of the license that applies to the resource
    - The name of the resource, for display in search results

We might also want to expose other data; the name of the author for 
instance would likely be useful.

Let's look at some ideas for how to address this. In the following markup 
snippets, I've used the following shortcuts:

  S = url to resources
  L = url to license
  B = license-required boilerplate text (typically name of license)
  T = name of resource
  A = name of author

Here are some ideas, along with brief commentary:

 * Idea 1: New attributes

    <img alt="" src="S" author="A" title="T" license="L">
    B

    - doesn't extend very well
    - lots of hidden metadata

 * Idea 2: Extending <figure>

    <figure>
     <img alt="" src="S">
     <legend>
      <cite>T</cite>
      <credit>A</credit>
      <small><a href="L" rel="fig-license">B</a></small>
     </legend>
    </figure>

    - doesn't extend very well
    - very likely to be ambiguous (e.g. legend could use <cite> to refer to other works)

 * Idea 3: RDFa and ccREL

    <div about="S" xmlns:http="http:">
     <img alt="" src="S">
     <span property="http://purl.org/dc/elements/1.1/title">T</span>
     <span property="http://creativecommons.org/ns#attributionName">A</span>
     <a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
    </div>

    - brittle: duplicates the "S" URL
    - a bit verbose

    <div typeof="" xmlns:http="http:">
     <span rel="http://www.w3.org/2002/07/owl#sameAs"><img alt="" src="S"></span>
     <!-- (this "works" though it relies on owl:sameAs inferencing) -->
     <span property="http://purl.org/dc/elements/1.1/title">T</span>
     <span property="http://creativecommons.org/ns#attributionName">A</span>
     <a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
    </div>

    - relies on advanced RDF features, likely to be misimplemented or confused
    - a bit verbose

    <object data="S" xmlns:http="http:" alt="">
     <span property="http://purl.org/dc/elements/1.1/title">T</span>
     <span property="http://creativecommons.org/ns#attributionName">A</span>
     <a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
    </object>

    - the contents of <object> are supposed to be fallback
    - a bit verbose

 * Idea 4: Microdata and ccREL

    <div item>
     <img itemprop="about" alt="" src="S">
     <cite itemprop="http://purl.org/dc/elements/1.1/title">T</cite>
     <span itemprop="http://creativecommons.org/ns#attributionName">A</span>
     <a itemprop="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
    </div>

    - a bit verbose

 * Idea 5: Microdata with new vocabulary

    <div item="work">
     <img itemprop="about" alt="" src="S">
     <cite itemprop="title">T</cite>
     <span itemprop="author">A</span>
     <a itemprop="license" href="L">B</a>
    </div>

    <figure item="work">
     <img itemprop="about" alt="" src="S">
     <legend>
      <cite itemprop="title">T</cite>
      <span itemprop="author">A</span>
      <small><a itemprop="license" href="L">B</a></small>
     </legend>
    </figure>

    - This could be defined as mapping to the previous one in the RDF
      conversion, so we could maintain the ccREL compatibility even
      with this.

 * Idea 6: EXIF

    - Requires editing the resource

Of these, the simplest is to just extend <img> (and <video>, and 
<article>), but it suffers from the hidden metadata problem.

The next simplest seems to be the microdata vocabulary, with a defined 
mapping to ccREL.

Let's see how it handles the scenarios and requirements:

   SCENARIOS:

     * If a user is looking for recipes of pies to reproduce on his
       blog, he might want to exclude from his results any recipes
       that are not available under a license allowing non-commercial
       reproduction.

A search engine with knowledge of which licenses allow non-commercial 
reproduction can restrict results to only matching resources.

     * Lucy wants to publish her papers online. She includes an
       abstract of each one in a page, but because they are under
       different copyright rules, she needs to clarify what the rules
       are. A harvester such as the Open Access project can actually
       collect and index some of them with no problem, but may not be
       allowed to index others. Meanwhile, a human finds it more
       useful to see the abstracts on a page than have to guess from a
       bunch of titles whether to look at each abstract.

This is now possible without new pages, if the harvester knows which 
licenses allow collection.

     * There are mapping organisations and data producers and people
       who take photos, and each may place different policies. Being
       able to keep that policy information helps people with further
       mashups avoiding violating a policy. For example, if
       GreatMaps.com has a public domain policy on their maps,
       CoolFotos.org has a policy that you can use data other than
       images for non-commercial purposes, and Johan Ichikawa has a
       photo there of my brother's cafe, which he has licensed as
       "must pay money", then it would be reasonable for me to copy
       the map and put it in a brochure for the cafe, but not to copy
       the data and photo from CoolFotos. On the other hand, if I am
       producing a non-commercial guide to cafes in Melbourne, I can
       add the map and the location of the cafe photo, but not the
       photo itself.

This isn't affected; it's a legal issue, not a technological one. So long 
as the licenses are clearly stated, as they presumably must be (for 
example, the MIT license requires the copyright text to follow the text 
even as it is copied, the Creative Commons licenses require the license or 
its URL to be published with any reproductions, etc), there is no need for 
any markup.

     * Tara runs a video sharing web site for people who want
       licensing information to be included with their videos. When
       Paul wants to blog about a video, he can paste a fragment of
       HTML provided by Tara directly into his blog. The video is then
       available inline in his blog, along with any licensing
       information about the video.

This is now possible in a straight-forward manner.

     * Fred's browser can tell him what license a particular video on
       a site he is reading has been released under, and advise him on
       what the associated permissions and restrictions are (can he
       redistribute this work for commercial purposes, can he
       distribute a modified version of this work, how should he
       assign credit to the original author, what jurisdiction the
       license assumes, whether the license allows the work to be
       embedded into a work that uses content under various other
       licenses, etc).

This is now possible, assuming that the browser knows the license.

     * Flickr has images that are CC-licensed, but the pages
       themselves are not.

This is already handled by the rules for rel=license, but the license of 
the image itself can be even more clearly given with the microdata.

     * Blogs may wish to reuse CC-licensed images without licensing
       the whole blog as CC, but while still including attribution and
       license information (which may be required by the licenses in
       question).

This is also possible.

Let's look at the requirements:

   REQUIREMENTS:

     * Content on a page might be covered by a different license than
       other content on the same page.

Same as the scenario above.

     * When licensing a subpart of the page, existing implementations
       must not just assume that the license applies to the whole page
       rather than just part of it.

Existing implementations will see nothing, so that is handled easily. 
ccREL implementations will immediately support the new vocabulary once the 
mapping of HTML to RDF described in the spec is implemented (less than 
24 hours' work, apparently).

     * License proliferation should be discouraged.

License proliferation is not really _discouraged_, but it isn't 
encouraged, since new licenses wouldn't be known by the various tools.

     * License information should be able to survive from one site to
       another as the data is transfered.

This is now possible, assuming the microdata is also transfered.

     * Expressing copyright licensing terms should be easy for content
       creators, publishers, and redistributors to provide.

It's as easy as writing a license has always been, I guess.

     * It should be more convenient for the users (and tools) to find
       and evaluate copyright statements and licenses than it is
       today.

I guess it's easier than rel=license, insofar as rel=license required new 
pages, and this doesn't. I don't know if that's what was meant, though. 
The actual processing model is more complex now.

     * Shouldn't require the consumer to write XSLT or server-side
       code to process the license information.

Some code is necessary to make a search engine index license microdata, 
but that seems inevitable and, in practice, is a small amount of code 
relative to the rest of the code involved in an indexing project.

     * Machine-readable licensing information shouldn't be on a
       separate page than human-readable licensing information.

It's on the same page.

     * There should not be ambiguous legal implications.

There are, as far as I can tell, no legal implications at all.

     * Parsing rules should be unambiguous.

Microdata parsing rules are unambgious.

     * Should not require changes to HTML5 parsing rules.

HTML5 parsing isn't affected.

     * It should be possible to find works that do not have their own
       pages, e.g. images in a library, or software.

This is now handled.

     * Each resource needs to have the following information:
        - The URL to the resource, for identification purposes
        - The identifier of the license that applies to the resource
        - The name of the resource, for display in search results

This is now handled.

I've added this vocabulary to the spec.

On Wed, 10 Jun 2009, Tab Atkins Jr. wrote:
> 
> I think it's fundamentally rare to have a bunch of resources that (a) 
> *only* exist grouped together on a single page, and (b) need different 
> licenses.

Apparently you and I are the only ones who believe this. :-(

(In addition to some of the e-mails on-list about this, I've also received 
a lot of off-list feedback to the same effect as Eduard's feedback.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'