[whatwg] Helping people seaching for content filtered by license
Ian Hickson
ian at hixie.ch
Tue Jul 7 16:20:46 PDT 2009
On Tue, 9 Jun 2009, Jeff Walden wrote:
> On 8.6.09 17:33, Ian Hickson wrote:
> > >
> > > - Search engines shouldn't be the gatekeeper when it comes to "valid"
> > > and "invalid" licenses. New licenses shouldn't be discouraged as
> > > they're vital to keep up with ever changing laws around the world. I
> > > don't want to wait around for search engines to decide that supporting
> > > a particular license is in their best interest.
> >
> > New licenses absolutely need to be discouraged. License proliferation is a
> > huge problem. Any solution we come up with here must absolutely be
> > designed in such a way as to make introducing a new license have a high
> > cost, because each new license causes further fragmentation of the
> > creative world.
>
> Notwithstanding that I agree that the world has a surfeit of licenses and
> generally gains little from new ones:
>
> What core principles for the technical design of HTML and the web dictate that
> license proliferation is a problem?
The "Do the right thing" principle. :-)
> Are there reasons to think that HTML5's implicit encouragement or
> discouragement of recognizing (or having the capability to recognize)
> many licenses would be effective at discouraging license creation, or
> more effective than more targeted actions in a separate venue to combat
> creation of new licenses?
I don't think it would be more effective than more targeted actions, but I
do think that if we make HTML5 make it easy to introduce new licenses,
that this will encourage new licenses to exist.
> Maybe I'm the only person who thinks it (I'd like to hope I'm merely the
> only person to say it, unless I've missed its mention in the past), but
> this feels like mission creep to me.
I don't think it's mission creep, it's just a help in figuring out what
our design for a feature should be.
On Wed, 10 Jun 2009, Eduard Pascual wrote:
> On Fri, May 8, 2009 at 9:57 PM, Ian Hickson <ian at hixie.ch> wrote:
> >
> > This has some implications:
> >
> > - Each unit of content (recipe in this case) must have its own
> > independent page at a distinct URL. This is actually good practice
> > anyway today for making content discoverable from search engines,
> > and it is compatible with what people already do, so this seems
> > fine.
>
> This is, on a wide range of cases, entirely impossible: while it might
> work, and maybe it's even good practice, for contents that can be
> represented on the web as a HTML document, it is not achievable for many
> other formats.
Fair enough. Let's look at the use case again then:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019668.html
...but adding two requirements; first the one from you:
* It should be possible to find works that do not have their own pages,
e.g. images in a library, or software.
...and next a requirement that falls out of this, which isn't needed when
resources have their own page, but is needed when multiple resources
share a page:
* Each resource needs to have the following information:
- The URL to the resource, for identification purposes
- The identifier of the license that applies to the resource
- The name of the resource, for display in search results
We might also want to expose other data; the name of the author for
instance would likely be useful.
Let's look at some ideas for how to address this. In the following markup
snippets, I've used the following shortcuts:
S = url to resources
L = url to license
B = license-required boilerplate text (typically name of license)
T = name of resource
A = name of author
Here are some ideas, along with brief commentary:
* Idea 1: New attributes
<img alt="" src="S" author="A" title="T" license="L">
B
- doesn't extend very well
- lots of hidden metadata
* Idea 2: Extending <figure>
<figure>
<img alt="" src="S">
<legend>
<cite>T</cite>
<credit>A</credit>
<small><a href="L" rel="fig-license">B</a></small>
</legend>
</figure>
- doesn't extend very well
- very likely to be ambiguous (e.g. legend could use <cite> to refer to other works)
* Idea 3: RDFa and ccREL
<div about="S" xmlns:http="http:">
<img alt="" src="S">
<span property="http://purl.org/dc/elements/1.1/title">T</span>
<span property="http://creativecommons.org/ns#attributionName">A</span>
<a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
</div>
- brittle: duplicates the "S" URL
- a bit verbose
<div typeof="" xmlns:http="http:">
<span rel="http://www.w3.org/2002/07/owl#sameAs"><img alt="" src="S"></span>
<!-- (this "works" though it relies on owl:sameAs inferencing) -->
<span property="http://purl.org/dc/elements/1.1/title">T</span>
<span property="http://creativecommons.org/ns#attributionName">A</span>
<a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
</div>
- relies on advanced RDF features, likely to be misimplemented or confused
- a bit verbose
<object data="S" xmlns:http="http:" alt="">
<span property="http://purl.org/dc/elements/1.1/title">T</span>
<span property="http://creativecommons.org/ns#attributionName">A</span>
<a rel="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
</object>
- the contents of <object> are supposed to be fallback
- a bit verbose
* Idea 4: Microdata and ccREL
<div item>
<img itemprop="about" alt="" src="S">
<cite itemprop="http://purl.org/dc/elements/1.1/title">T</cite>
<span itemprop="http://creativecommons.org/ns#attributionName">A</span>
<a itemprop="http://www.w3.org/1999/xhtml/vocab#license" href="L">B</a>
</div>
- a bit verbose
* Idea 5: Microdata with new vocabulary
<div item="work">
<img itemprop="about" alt="" src="S">
<cite itemprop="title">T</cite>
<span itemprop="author">A</span>
<a itemprop="license" href="L">B</a>
</div>
<figure item="work">
<img itemprop="about" alt="" src="S">
<legend>
<cite itemprop="title">T</cite>
<span itemprop="author">A</span>
<small><a itemprop="license" href="L">B</a></small>
</legend>
</figure>
- This could be defined as mapping to the previous one in the RDF
conversion, so we could maintain the ccREL compatibility even
with this.
* Idea 6: EXIF
- Requires editing the resource
Of these, the simplest is to just extend <img> (and <video>, and
<article>), but it suffers from the hidden metadata problem.
The next simplest seems to be the microdata vocabulary, with a defined
mapping to ccREL.
Let's see how it handles the scenarios and requirements:
SCENARIOS:
* If a user is looking for recipes of pies to reproduce on his
blog, he might want to exclude from his results any recipes
that are not available under a license allowing non-commercial
reproduction.
A search engine with knowledge of which licenses allow non-commercial
reproduction can restrict results to only matching resources.
* Lucy wants to publish her papers online. She includes an
abstract of each one in a page, but because they are under
different copyright rules, she needs to clarify what the rules
are. A harvester such as the Open Access project can actually
collect and index some of them with no problem, but may not be
allowed to index others. Meanwhile, a human finds it more
useful to see the abstracts on a page than have to guess from a
bunch of titles whether to look at each abstract.
This is now possible without new pages, if the harvester knows which
licenses allow collection.
* There are mapping organisations and data producers and people
who take photos, and each may place different policies. Being
able to keep that policy information helps people with further
mashups avoiding violating a policy. For example, if
GreatMaps.com has a public domain policy on their maps,
CoolFotos.org has a policy that you can use data other than
images for non-commercial purposes, and Johan Ichikawa has a
photo there of my brother's cafe, which he has licensed as
"must pay money", then it would be reasonable for me to copy
the map and put it in a brochure for the cafe, but not to copy
the data and photo from CoolFotos. On the other hand, if I am
producing a non-commercial guide to cafes in Melbourne, I can
add the map and the location of the cafe photo, but not the
photo itself.
This isn't affected; it's a legal issue, not a technological one. So long
as the licenses are clearly stated, as they presumably must be (for
example, the MIT license requires the copyright text to follow the text
even as it is copied, the Creative Commons licenses require the license or
its URL to be published with any reproductions, etc), there is no need for
any markup.
* Tara runs a video sharing web site for people who want
licensing information to be included with their videos. When
Paul wants to blog about a video, he can paste a fragment of
HTML provided by Tara directly into his blog. The video is then
available inline in his blog, along with any licensing
information about the video.
This is now possible in a straight-forward manner.
* Fred's browser can tell him what license a particular video on
a site he is reading has been released under, and advise him on
what the associated permissions and restrictions are (can he
redistribute this work for commercial purposes, can he
distribute a modified version of this work, how should he
assign credit to the original author, what jurisdiction the
license assumes, whether the license allows the work to be
embedded into a work that uses content under various other
licenses, etc).
This is now possible, assuming that the browser knows the license.
* Flickr has images that are CC-licensed, but the pages
themselves are not.
This is already handled by the rules for rel=license, but the license of
the image itself can be even more clearly given with the microdata.
* Blogs may wish to reuse CC-licensed images without licensing
the whole blog as CC, but while still including attribution and
license information (which may be required by the licenses in
question).
This is also possible.
Let's look at the requirements:
REQUIREMENTS:
* Content on a page might be covered by a different license than
other content on the same page.
Same as the scenario above.
* When licensing a subpart of the page, existing implementations
must not just assume that the license applies to the whole page
rather than just part of it.
Existing implementations will see nothing, so that is handled easily.
ccREL implementations will immediately support the new vocabulary once the
mapping of HTML to RDF described in the spec is implemented (less than
24 hours' work, apparently).
* License proliferation should be discouraged.
License proliferation is not really _discouraged_, but it isn't
encouraged, since new licenses wouldn't be known by the various tools.
* License information should be able to survive from one site to
another as the data is transfered.
This is now possible, assuming the microdata is also transfered.
* Expressing copyright licensing terms should be easy for content
creators, publishers, and redistributors to provide.
It's as easy as writing a license has always been, I guess.
* It should be more convenient for the users (and tools) to find
and evaluate copyright statements and licenses than it is
today.
I guess it's easier than rel=license, insofar as rel=license required new
pages, and this doesn't. I don't know if that's what was meant, though.
The actual processing model is more complex now.
* Shouldn't require the consumer to write XSLT or server-side
code to process the license information.
Some code is necessary to make a search engine index license microdata,
but that seems inevitable and, in practice, is a small amount of code
relative to the rest of the code involved in an indexing project.
* Machine-readable licensing information shouldn't be on a
separate page than human-readable licensing information.
It's on the same page.
* There should not be ambiguous legal implications.
There are, as far as I can tell, no legal implications at all.
* Parsing rules should be unambiguous.
Microdata parsing rules are unambgious.
* Should not require changes to HTML5 parsing rules.
HTML5 parsing isn't affected.
* It should be possible to find works that do not have their own
pages, e.g. images in a library, or software.
This is now handled.
* Each resource needs to have the following information:
- The URL to the resource, for identification purposes
- The identifier of the license that applies to the resource
- The name of the resource, for display in search results
This is now handled.
I've added this vocabulary to the spec.
On Wed, 10 Jun 2009, Tab Atkins Jr. wrote:
>
> I think it's fundamentally rare to have a bunch of resources that (a)
> *only* exist grouped together on a single page, and (b) need different
> licenses.
Apparently you and I are the only ones who believe this. :-(
(In addition to some of the e-mails on-list about this, I've also received
a lot of off-list feedback to the same effect as Eduard's feedback.)
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list