[whatwg] Helping people seaching for content filtered by license

Ian Hickson ian at hixie.ch
Fri May 8 12:57:31 PDT 2009


One of the use cases I collected from the e-mails sent in over the past 
few months was the following:

   USE CASE: Help people searching for content to find content covered by
   licenses that suit their needs.

   SCENARIOS:
     * If a user is looking for recipes of pies to reproduce on his blog, he
       might want to exclude from his results any recipes that are not
       available under a license allowing non-commercial reproduction.
     * Lucy wants to publish her papers online. She includes an abstract of
       each one in a page, but because they are under different copyright
       rules, she needs to clarify what the rules are. A harvester such as
       the Open Access project can actually collect and index some of them
       with no problem, but may not be allowed to index others. Meanwhile, a
       human finds it more useful to see the abstracts on a page than have to
       guess from a bunch of titles whether to look at each abstract.
     * There are mapping organisations and data producers and people who take
       photos, and each may place different policies. Being able to keep that
       policy information helps people with further mashups avoiding
       violating a policy. For example, if GreatMaps.com has a public domain
       policy on their maps, CoolFotos.org has a policy that you can use data
       other than images for non-commercial purposes, and Johan Ichikawa has
       a photo there of my brother's cafe, which he has licensed as "must pay
       money", then it would be reasonable for me to copy the map and put it
       in a brochure for the cafe, but not to copy the data and photo from
       CoolFotos. On the other hand, if I am producing a non-commercial guide
       to cafes in Melbourne, I can add the map and the location of the cafe
       photo, but not the photo itself.
     * Tara runs a video sharing web site for people who want licensing
       information to be included with their videos. When Paul wants to blog
       about a video, he can paste a fragment of HTML provided by Tara
       directly into his blog. The video is then available inline in his
       blog, along with any licensing information about the video.
     * Fred's browser can tell him what license a particular video on a site
       he is reading has been released under, and advise him on what the
       associated permissions and restrictions are (can he redistribute this
       work for commercial purposes, can he distribute a modified version of
       this work, how should he assign credit to the original author, what
       jurisdiction the license assumes, whether the license allows the work
       to be embedded into a work that uses content under various other
       licenses, etc).
     * Flickr has images that are CC-licensed, but the pages themselves are
       not.
     * Blogs may wish to reuse CC-licensed images without licensing the whole
       blog as CC, but while still including attribution and license
       information (which may be required by the licenses in question).

   REQUIREMENTS:
     * Content on a page might be covered by a different license than other
       content on the same page.
     * When licensing a subpart of the page, existing implementations must
       not just assume that the license applies to the whole page rather than
       just part of it.
     * License proliferation should be discouraged.
     * License information should be able to survive from one site to another
       as the data is transfered.
     * Expressing copyright licensing terms should be easy for content
       creators, publishers, and redistributors to provide.
     * It should be more convenient for the users (and tools) to find and
       evaluate copyright statements and licenses than it is today.
     * Shouldn't require the consumer to write XSLT or server-side code to
       process the license information.
     * Machine-readable licensing information shouldn't be on a separate page
       than human-readable licensing information.
     * There should not be ambiguous legal implications.
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.


The scenarios described above fall into three categories: searching for 
content, publishing content, and obtaining legal advice.

First, I will examine the search scenario:

     * If a user is looking for recipes of pies to reproduce on his blog, he
       might want to exclude from his results any recipes that are not
       available under a license allowing non-commercial reproduction.

This is technically possible today. The rel="license" link type allows 
authors to specify the license that applies to the main content on a page, 
in this case recipes, search engines can be programmed with the most 
common licenses, and the user can tell the search engine what 
characteristics he wants ("compatible with GPLv2", "no advertising 
clause", "doesn't have patent implications", "allows redistribution to 
countries on the US blacklist").

This has some implications:

 - Each unit of content (recipe in this case) must have its own 
   independent page at a distinct URL. This is actually good practice 
   anyway today for making content discoverable from search engines, and 
   it is compatible with what people already do, so this seems fine.

 - New licenses are discouraged, as they would not be automatically 
   supported by search engines. This is needed by one of the requirements:

     * License proliferation should be discouraged.

This solution is already deployed on such sites as Flickr, and already 
supported on search engines such as Google.



Next, I will look at the content publishing scenarios:

     * Lucy wants to publish her papers online. She includes an abstract of
       each one in a page, but because they are under different copyright
       rules, she needs to clarify what the rules are. A harvester such as
       the Open Access project can actually collect and index some of them
       with no problem, but may not be allowed to index others. Meanwhile, a
       human finds it more useful to see the abstracts on a page than have to
       guess from a bunch of titles whether to look at each abstract.

This really boils down to two points:

 - Being able to include the license of various items on a page for humans 
   to read.

 - Being able to control what harvesters (spiders) index.

Being able to include a license on a page is easy: you just include the 
license name and a link to the license. Since this is for the user in this 
case, there is no need for any special markup.

Controlling harvesters is a separate problem. This is actually a 
well-understood problem space with a number of very well-understood 
solutions. For site-wide control, there is robots.txt, which can target 
individual spiders (as in this case). On a page-by-page basis, there is 
the <meta> element's "noindex" value.

Thus this particular scenario doesn't require any new features.


     * There are mapping organisations and data producers and people who take
       photos, and each may place different policies. Being able to keep that
       policy information helps people with further mashups avoiding
       violating a policy. For example, if GreatMaps.com has a public domain
       policy on their maps, CoolFotos.org has a policy that you can use data
       other than images for non-commercial purposes, and Johan Ichikawa has
       a photo there of my brother's cafe, which he has licensed as "must pay
       money", then it would be reasonable for me to copy the map and put it
       in a brochure for the cafe, but not to copy the data and photo from
       CoolFotos. On the other hand, if I am producing a non-commercial guide
       to cafes in Melbourne, I can add the map and the location of the cafe
       photo, but not the photo itself.

This doesn't seem to require any technological solution at all; it seems 
to be purely a legal issue. So long as the licenses are clearly stated, as 
they presumably must be (for example, the MIT license requires the 
copyright text to follow the text even as it is copied, the Creative 
Commons licenses require the license or its URL to be published with any 
reproductions, etc), there is no need for any markup.


     * Tara runs a video sharing web site for people who want licensing
       information to be included with their videos. When Paul wants to blog
       about a video, he can paste a fragment of HTML provided by Tara
       directly into his blog. The video is then available inline in his
       blog, along with any licensing information about the video.

(Really? A video sharing site dedicated to people who want licensing 
information to be included with their videos? That's a pretty specific 
audience, wow.)

This can be done with HTML5 today. For example, here is the markup you 
could include to allow someone to embed a video on their site while 
including the copyright or license information:

   <figure>
    <video src="http://example.com/videodata/sJf-ulirNRk" controls>
     <a href="http://video.example.com/watch?v=sJf-ulirNRk">Watch</a>
    </video>
    <legend>
     Pillar post surgery, starting to heal.
     <small>&copy; copyright 2008 Pillar. All Rights Reserved.</small>
    </legend>
   </figure>


     * Flickr has images that are CC-licensed, but the pages themselves are
       not.

I've clarified the HTML5 spec's definition of rel=license and included an 
example showing a page based on what Flickr is doing.

For search, this has the right result (when a search page returns an 
arbitrary Flickr page, it's because they are looking for the image, which 
is what the rel=license link Flickr gives is for), and it is clear to the 
user (they see the license information clearly, and there's no confusion 
that it might apply to the rest of the page).


     * Blogs may wish to reuse CC-licensed images without licensing the whole
       blog as CC, but while still including attribution and license
       information (which may be required by the licenses in question).

The example above shows this for a movie, but it works as well for a 
photo:

   <figure>
     <img src="http://nearimpossible.com/DSCF0070-1-tm.jpg" alt="">
     <legend>
      Picture by Bob.
      <small><a href="http://creativecommons.org/licenses/by-nc-sa/2.5/legalcode">Creative 
      Commons Attribution-Noncommercial-Share Alike 2.5 Generic License</a></small>
     </legend>
   </figure>

Admittedly, if this scenario is taken in the context of the first 
scenario, meaning that Bob wants this image to be discoverable through 
search, but doesn't want to include it on a page of its own, then extra 
syntax to mark this particular image up would be useful.

However, in my research I found very few such cases. In every case where I 
found multiple media items on a single page with no dedicated page, either 
every item was licensed identically and was the main content of the page, 
or each item had its own separate page, or the items were licensed under 
the same license as the page. In all three of these cases, rel=license 
already solves the problem today. This discourages people from using 
multiple licenses, of course, but that's actually a good thing, as it 
discourages license proliferation.



     * Fred's browser can tell him what license a particular video on a site
       he is reading has been released under, and advise him on what the
       associated permissions and restrictions are (can he redistribute this
       work for commercial purposes, can he distribute a modified version of
       this work, how should he assign credit to the original author, what
       jurisdiction the license assumes, whether the license allows the work
       to be embedded into a work that uses content under various other
       licenses, etc).

Advising a user on the legal implications of a license is something that 
needs trained professionals, but given a particular license, advice could 
be provided in canned form. So it seems like this is already possible, the 
user just has to select a license from a list of licenses. A user agent 
could pre-select a license based on the value of the page's rel=license 
link(s), or based on scanning the page for mention of a license, too.




I will now examine how these solutions fit the requirements:

     * Content on a page might be covered by a different license than other
       content on the same page.

For the search case, this is handled by separating that content so that 
each page that is to be discoverable via a license filter is discoverable 
at independent URLs. This appears to fit the existing practice.


     * When licensing a subpart of the page, existing implementations must
       not just assume that the license applies to the whole page rather than
       just part of it.

This is resolved by not having a mechanism for machine-readably licensing 
just part of a page, and instead putting such content on its own page, 
which leads to a better experience anyway from a search perspective.


     * License proliferation should be discouraged.

I've mentioned several ways in which this is achieved.


     * License information should be able to survive from one site to another
       as the data is transfered.

This is clearly possible if the license information is mere text, and 
indeed some of the licenses already require this anyway.


     * Expressing copyright licensing terms should be easy for content
       creators, publishers, and redistributors to provide.

It is simple to express license terms, since no special syntax beyond the 
legally-required verbiage is necessary.


     * It should be more convenient for the users (and tools) to find and
       evaluate copyright statements and licenses than it is today.

This requirement is not met. I do not know of any way to improve matters 
beyond what is available today, unfortunately. This would probably require 
lesiglative simplifications of copyright law, which, while probably quite 
desireable, are somewhat out of scope of HTML5.


     * Shouldn't require the consumer to write XSLT or server-side code to
       process the license information.

No XSLT is necessary for consuming rel="license". Some code is necessary 
to make a search engine to index rel="license" data, but that seems 
inevitable and, in practice, is a small amount of code relative to the 
rest of the code involved in an indexing project.


     * Machine-readable licensing information shouldn't be on a separate page
       than human-readable licensing information.

Since the only machine-readable licensing information that HTML5 provides 
for is rel="license", and that can only be included as part of a link on 
the page with the license statement, this requirement is met.


     * There should not be ambiguous legal implications.

This requirement is met only insofar as rel="license" doesn't make any 
firm legal statements; the page is required to be unambiguous about what 
is considered the main content. In practice I think we can see that sites 
like Flickr have found that this is possible. Beyond that, since the only 
legal implications here are those that would be present without any markup 
at all (i.e. copyright statements, etc), the legal implications of HTML5's 
features seem minimal.


     * Parsing rules should be unambiguous.

The rules for parsing rel="" values are well-understood and clear, I 
believe.


     * Should not require changes to HTML5 parsing rules.

No parsing rules changed during the addressing of these use cases.



In conclusion, it seems most of these use cases are already handled by the 
current text in the spec and do not show a need for a more elaborate 
scheme. The rel="license" feature in particular handles search adequately, 
and is already deployed both in consumers and generators. Two areas where 
we could add more syntax-level support would be in licensing subparts 
explicitly, and in providing machine-readable licenses. The former seems 
like an obvious need but actual deployed content doesn't seem to need it, 
since most individually licensed works exist on pages of their own 
already, or are covered by the same license as other works on the same 
page. The latter is a dangerous area to get into, as licenses have very 
specific legal wording. We could never make the machine-readable text have 
legal standing, and thus people couldn't rely on it to draw conclusions 
anyway. Furthermore, it would encourage license proliferation which is 
already a problem.


A number of further use cases remain to be examined, including some more 
specifically looking at attribution rather than licensing. I will send 
further e-mail next week as I address them.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list