[whatwg] Helping people seaching for content filtered by license

Mon Jun 8 17:33:07 PDT 2009

On Sun, 10 May 2009, Manu Sporny wrote:
> Ian Hickson wrote:
> > The scenarios described above fall into three categories: searching 
> > for content, publishing content, and obtaining legal advice.
> 
> Ian, these use cases and your responses to them need to be put up on a 
> wiki somewhere - not many people have the time to read through 20 use 
> cases each with numerous scenarios/requirements and argumentation.

If anyone has the time and inclination to do that, please be my guest -- I 
unfortunately don't have the time to both deal with the feedback and 
maintain documentation on how each decision was made. If anyone would like 
to volunteer to do that, though, let me know, I'd be more than happy to 
help set this kind of thing up. It would indeed be very useful. (In 
practice, pretty much everything in the WHATWG is done through people 
volunteering to do it, whether that be running the blog, the forums, 
editing the spec, reviewing it, etc.)

> Both Shelley Powers and I have offered to help do this, I strongly 
> suggest you take us up on the offer as we want to be thorough and 
> careful as we go through these use cases. Most importantly, we want to 
> start generating preliminary documentation for HTML5's microdata 
> facilities.

If there's anything I can do to help you do it, please let me know. You 
should feel free and empowered to do anything that you believe is 
necessary here, whether adding pages to the WHATWG wiki, to the blog, 
writing separate sites, or whatever you find useful. Your help here would 
be more than welcome. Please don't let me stand in your way.

> > First, I will examine the search scenario:
> > 
> >      * If a user is looking for recipes of pies to reproduce on his blog, he
> >        might want to exclude from his results any recipes that are not
> >        available under a license allowing non-commercial reproduction.
> > 
> > This is technically possible today. The rel="license" link type allows 
> > authors to specify the license that applies to the main content on a page, 
> > in this case recipes, search engines can be programmed with the most 
> > common licenses, and the user can tell the search engine what 
> > characteristics he wants ("compatible with GPLv2", "no advertising 
> > clause", "doesn't have patent implications", "allows redistribution to 
> > countries on the US blacklist").
> > 
> > This has some implications:
> > 
> >  - Each unit of content (recipe in this case) must have its own 
> >    independent page at a distinct URL. This is actually good practice 
> >    anyway today for making content discoverable from search engines, and 
> >    it is compatible with what people already do, so this seems fine.
> > 
> >  - New licenses are discouraged, as they would not be automatically 
> >    supported by search engines. This is needed by one of the requirements:
> > 
> >      * License proliferation should be discouraged.
> > 
> > This solution is already deployed on such sites as Flickr, and already 
> > supported on search engines such as Google.
> 
> This isn't a solution to the problem because of the following issues:
> 
> - Search engines shouldn't be the gatekeeper when it comes to "valid"
>   and "invalid" licenses. New licenses shouldn't be discouraged as
>   they're vital to keep up with ever changing laws around the world. I
>   don't want to wait around for search engines to decide that supporting
>   a particular license is in their best interest.

New licenses absolutely need to be discouraged. License proliferation is a 
huge problem. Any solution we come up with here must absolutely be 
designed in such a way as to make introducing a new license have a high 
cost, because each new license causes further fragmentation of the 
creative world.

> - You're forcing recipe page authors to publish their documentation in a
>   very specific way (one recipe per page). That seems like a very
>   arbitrary limitation. What if I have a goat cheese recipe
>   (attribution, non-commercial) that has a jam recipe
>   (attribution, share-alike) that pairs very well with the cheese.
>   Personally, I'd want to put both recipes on the same page and
>   rel="license" has a limitation that it only applies to the "main
>   content" on the page in HTML5. Why is HTML5 forcing me to publish one
>   recipe per page if I want to ensure that my license information is
>   machine-readable?

You have the cause-and-effect backwards. As far as I can tell it is 
extremely common for licensed works to already have their own page. If 
this is correct, then we don't have to support individually licensed works 
on separate pages. All the pages I looked at for recipes had views for one 
page per recipe (e.g. allrecipes.com); same with photos (e.g. Flickr).

If there are substantial examples where this isn't the case, then this 
would argue for providing more complex per-resource or per-section 
licensing; are there such examples?

> > Next, I will look at the content publishing scenarios:
> > 
> >      * Lucy wants to publish her papers online. She includes an abstract of
> >        each one in a page, but because they are under different copyright
> >        rules, she needs to clarify what the rules are. A harvester such as
> >        the Open Access project can actually collect and index some of them
> >        with no problem, but may not be allowed to index others. Meanwhile, a
> >        human finds it more useful to see the abstracts on a page than have to
> >        guess from a bunch of titles whether to look at each abstract.
> > 
> > This really boils down to two points:
> > 
> >  - Being able to include the license of various items on a page for humans 
> >    to read.
> > 
> >  - Being able to control what harvesters (spiders) index.
> > 
> > Being able to include a license on a page is easy: you just include the 
> > license name and a link to the license. Since this is for the user in this 
> > case, there is no need for any special markup.
> 
> Including a link to a license is very different from making that license
> automatically machine-readable.

For this scenario, we don't need to make the license readable to the 
machines. Only the humans need them.

> Expressing, to the spider, what rights it has on that particular page is 
> not as simple as linking to a license. Sure, you can hard code certain 
> rights and associate them with certain license URLs - but that's a hack, 
> it's not a good general purpose solution.

I don't understand the relevance of the spider to the license in this 
scenario.

> > Controlling harvesters is a separate problem. This is actually a 
> > well-understood problem space with a number of very well-understood 
> > solutions. For site-wide control, there is robots.txt, which can target 
> > individual spiders (as in this case). On a page-by-page basis, there is 
> > the <meta> element's "noindex" value.
> 
> Not all authors have access to the robots.txt file (in fact, the
> majority of researchers I know use a CMS and do not have any access to
> robots.txt). Not all researchers have access to the HEAD of the document
> via <meta> either. So, this isn't a viable general purpose solution.

One could equally argue that not all researchers have access to the markup 
at all. It is a fundamental assumption for the purpose of writing the 
HTML5 spec that authors have the ability to write HTML.

> >      * There are mapping organisations and data producers and people who take
> >        photos, and each may place different policies. Being able to keep that
> >        policy information helps people with further mashups avoiding
> >        violating a policy. For example, if GreatMaps.com has a public domain
> >        policy on their maps, CoolFotos.org has a policy that you can use data
> >        other than images for non-commercial purposes, and Johan Ichikawa has
> >        a photo there of my brother's cafe, which he has licensed as "must pay
> >        money", then it would be reasonable for me to copy the map and put it
> >        in a brochure for the cafe, but not to copy the data and photo from
> >        CoolFotos. On the other hand, if I am producing a non-commercial guide
> >        to cafes in Melbourne, I can add the map and the location of the cafe
> >        photo, but not the photo itself.
> > 
> > This doesn't seem to require any technological solution at all; it 
> > seems to be purely a legal issue. So long as the licenses are clearly 
> > stated, as they presumably must be (for example, the MIT license 
> > requires the copyright text to follow the text even as it is copied, 
> > the Creative Commons licenses require the license or its URL to be 
> > published with any reproductions, etc), there is no need for any 
> > markup.
> 
> Read the scenario more closely - the goal of the scenario is to "help 
> people with further mashups [to avoid] violating a policy". This means 
> that all of the objects in the scenario must be tagged with licensing 
> information of some kind.

I don't see how you go from the former to the latter. Why does helping 
authors requiring machine-readable tagging? Furthermore, nothing in that 
scenario is actually talking about computers helping anything. It's about 
authors not violating licenses, something which doesn't require any markup 
at all and is in fact completely independent of the technology used.

We should not jump to solving problems with technology when that isn't 
necessary.

> >      * Tara runs a video sharing web site for people who want licensing
> >        information to be included with their videos. When Paul wants to blog
> >        about a video, he can paste a fragment of HTML provided by Tara
> >        directly into his blog. The video is then available inline in his
> >        blog, along with any licensing information about the video.
> > 
> > This can be done with HTML5 today. For example, here is the markup you 
> > could include to allow someone to embed a video on their site while 
> > including the copyright or license information:
> > 
> >    <figure>
> >     <video src="http://example.com/videodata/sJf-ulirNRk" controls>
> >      <a href="http://video.example.com/watch?v=sJf-ulirNRk">Watch</a>
> >     </video>
> >     <legend>
> >      Pillar post surgery, starting to heal.
> >      <small>© copyright 2008 Pillar. All Rights Reserved.</small>
> >     </legend>
> >    </figure>
> 
> That's not machine readable - where in the HTML5 spec does it say that 
> "license information for videos will be placed in a SMALL element".

Nowhere in this scenario is there any reason to want the data to be 
machine-readable, as far as I can tell.

> Even if that were the case, how would a machine be able to extract the 
> rights associated with the license?

It could not, but that's academic, as the scenario doesn't describe any 
machines extracting anything.

> > I've clarified the HTML5 spec's definition of rel=license and included 
> > an example showing a page based on what Flickr is doing.
> 
> Just changing rel="license" is not going to be sufficient.

Sufficent for what? It seems sufficient to address all the scearios that I 
collected for this particular use case. If there are scenarios that I 
missed, please describe them.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'