[whatwg] Helping people seaching for content filtered by license

Sat May 9 21:11:22 PDT 2009

Ian Hickson wrote:
> The scenarios described above fall into three categories: searching for 
> content, publishing content, and obtaining legal advice.

Ian, these use cases and your responses to them need to be put up on a
wiki somewhere - not many people have the time to read through 20 use
cases each with numerous scenarios/requirements and argumentation.

It's going to be very difficult for anybody other than you to piece
together what has been addressed and what has not been addressed.

In addition, transparency and ease of access during this process should
be a requirement. Transparency cannot happen unless you post all of the
information that you have to some permanent location.

Both Shelley Powers and I have offered to help do this, I strongly
suggest you take us up on the offer as we want to be thorough and
careful as we go through these use cases. Most importantly, we want to
start generating preliminary documentation for HTML5's microdata facilities.

> First, I will examine the search scenario:
> 
>      * If a user is looking for recipes of pies to reproduce on his blog, he
>        might want to exclude from his results any recipes that are not
>        available under a license allowing non-commercial reproduction.
> 
> This is technically possible today. The rel="license" link type allows 
> authors to specify the license that applies to the main content on a page, 
> in this case recipes, search engines can be programmed with the most 
> common licenses, and the user can tell the search engine what 
> characteristics he wants ("compatible with GPLv2", "no advertising 
> clause", "doesn't have patent implications", "allows redistribution to 
> countries on the US blacklist").
> 
> This has some implications:
> 
>  - Each unit of content (recipe in this case) must have its own 
>    independent page at a distinct URL. This is actually good practice 
>    anyway today for making content discoverable from search engines, and 
>    it is compatible with what people already do, so this seems fine.
> 
>  - New licenses are discouraged, as they would not be automatically 
>    supported by search engines. This is needed by one of the requirements:
> 
>      * License proliferation should be discouraged.
> 
> This solution is already deployed on such sites as Flickr, and already 
> supported on search engines such as Google.

This isn't a solution to the problem because of the following issues:

- Search engines shouldn't be the gatekeeper when it comes to "valid"
  and "invalid" licenses. New licenses shouldn't be discouraged as
  they're vital to keep up with ever changing laws around the world. I
  don't want to wait around for search engines to decide that supporting
  a particular license is in their best interest.

- You're forcing recipe page authors to publish their documentation in a
  very specific way (one recipe per page). That seems like a very
  arbitrary limitation. What if I have a goat cheese recipe
  (attribution, non-commercial) that has a jam recipe
  (attribution, share-alike) that pairs very well with the cheese.
  Personally, I'd want to put both recipes on the same page and
  rel="license" has a limitation that it only applies to the "main
  content" on the page in HTML5. Why is HTML5 forcing me to publish one
  recipe per page if I want to ensure that my license information is
  machine-readable?

> Next, I will look at the content publishing scenarios:
> 
>      * Lucy wants to publish her papers online. She includes an abstract of
>        each one in a page, but because they are under different copyright
>        rules, she needs to clarify what the rules are. A harvester such as
>        the Open Access project can actually collect and index some of them
>        with no problem, but may not be allowed to index others. Meanwhile, a
>        human finds it more useful to see the abstracts on a page than have to
>        guess from a bunch of titles whether to look at each abstract.
> 
> This really boils down to two points:
> 
>  - Being able to include the license of various items on a page for humans 
>    to read.
> 
>  - Being able to control what harvesters (spiders) index.
> 
> Being able to include a license on a page is easy: you just include the 
> license name and a link to the license. Since this is for the user in this 
> case, there is no need for any special markup.

Including a link to a license is very different from making that license
automatically machine-readable. Expressing, to the spider, what rights
it has on that particular page is not as simple as linking to a license.
Sure, you can hard code certain rights and associate them with certain
license URLs - but that's a hack, it's not a good general purpose solution.

> Controlling harvesters is a separate problem. This is actually a 
> well-understood problem space with a number of very well-understood 
> solutions. For site-wide control, there is robots.txt, which can target 
> individual spiders (as in this case). On a page-by-page basis, there is 
> the <meta> element's "noindex" value.

Not all authors have access to the robots.txt file (in fact, the
majority of researchers I know use a CMS and do not have any access to
robots.txt). Not all researchers have access to the HEAD of the document
via <meta> either. So, this isn't a viable general purpose solution.

> Thus this particular scenario doesn't require any new features.
> 
>      * There are mapping organisations and data producers and people who take
>        photos, and each may place different policies. Being able to keep that
>        policy information helps people with further mashups avoiding
>        violating a policy. For example, if GreatMaps.com has a public domain
>        policy on their maps, CoolFotos.org has a policy that you can use data
>        other than images for non-commercial purposes, and Johan Ichikawa has
>        a photo there of my brother's cafe, which he has licensed as "must pay
>        money", then it would be reasonable for me to copy the map and put it
>        in a brochure for the cafe, but not to copy the data and photo from
>        CoolFotos. On the other hand, if I am producing a non-commercial guide
>        to cafes in Melbourne, I can add the map and the location of the cafe
>        photo, but not the photo itself.
> 
> This doesn't seem to require any technological solution at all; it seems 
> to be purely a legal issue. So long as the licenses are clearly stated, as 
> they presumably must be (for example, the MIT license requires the 
> copyright text to follow the text even as it is copied, the Creative 
> Commons licenses require the license or its URL to be published with any 
> reproductions, etc), there is no need for any markup.

Read the scenario more closely - the goal of the scenario is to "help
people with further mashups [to avoid] violating a policy". This means
that all of the objects in the scenario must be tagged with licensing
information of some kind. As it is explained above, HTML5 doesn't
currently have that ability via rel="license". The technological
solution could warn the person when they download the image of their
rights associated with the image. The operating system could even
associate the licensing information with the image when it is stored to
the local file system.

This is not just a legal issue - there are ways that this information
could help people comply with legal policies. I find Flickr's
"Attribution Creative Commons licenses" search feature incredibly useful
- it would be even more useful if all documents on the web had that
information associated with them.

>      * Tara runs a video sharing web site for people who want licensing
>        information to be included with their videos. When Paul wants to blog
>        about a video, he can paste a fragment of HTML provided by Tara
>        directly into his blog. The video is then available inline in his
>        blog, along with any licensing information about the video.
> 
> (Really? A video sharing site dedicated to people who want licensing 
> information to be included with their videos? That's a pretty specific 
> audience, wow.)

It's not specific at all - video rights clearinghouses are a big
business in news, documentaries, television and blockbuster movies.
There are hundreds of companies just like this one:

http://www.thoughtequity.com/

> This can be done with HTML5 today. For example, here is the markup you 
> could include to allow someone to embed a video on their site while 
> including the copyright or license information:
> 
>    <figure>
>     <video src="http://example.com/videodata/sJf-ulirNRk" controls>
>      <a href="http://video.example.com/watch?v=sJf-ulirNRk">Watch</a>
>     </video>
>     <legend>
>      Pillar post surgery, starting to heal.
>      <small>&copy; copyright 2008 Pillar. All Rights Reserved.</small>
>     </legend>
>    </figure>

That's not machine readable - where in the HTML5 spec does it say that
"license information for videos will be placed in a SMALL element". Even
if that were the case, how would a machine be able to extract the rights
associated with the license?

>      * Flickr has images that are CC-licensed, but the pages themselves are
>        not.

Flickr already uses RDFa to express the rights for the images, which is
impossible to do in HTML5:

http://www.flickr.com/photos/vegaseddie/3339507278/

More specifically, these are a few of the RDFa triples embedded in that
page that contain information pertinent to the license (TURTLE triple
syntax):

<http://www.flickr.com/photos/vegaseddie/3339507278/>
     cc:attributionURL <http://www.flickr.com/photos/vegaseddie/> ;
     cc:license <http://creativecommons.org/licenses/by/2.0/deed.en> ;
     dc:creator <http://www.flickr.com/photos/vegaseddie/> ;
     dc:title "Baboon" ;
     xhv:license <http://creativecommons.org/licenses/by/2.0/deed.en> ;
<http://www.flickr.com/photos/vegaseddie/> foaf:name "Paolo Camera" .

(Triples extracted via pyRdfa: http://www.w3.org/2007/08/pyRdfa/)

> I've clarified the HTML5 spec's definition of rel=license and included an 
> example showing a page based on what Flickr is doing.

Just changing rel="license" is not going to be sufficient. You will have
to express the same information, as shown above, in HTML5 if you would
like to support the same functionality that Flickr supports today.

This e-mail only covers the first 1/3rd of this single use case - as you
can see, there are still many issues to resolve. I'll go through the
rest of this use case e-mail when I have some more free time... hope
this helps. :)

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: A Collaborative Distribution Model for Music
http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/