[whatwg] re-thinking "cue ranges"

Tue Jul 15 15:26:45 PDT 2008

Hi Philip, Dave, all,

I agree with Philip and Dave that we need a simple way to include the
cue ranges concept into video for video authors.

As one of the authors of Annodex, I have been meaning to look over the
HTML5 video element for a while and examine how it's details works -
sorry for my late contribution.

In Annodex we created a simple XML markup language called CMML (for
Continuous Media Markup Language) that would turn time-continuous data
such as audio and video into Web-style documents with the ability to
define temporal segments (or events or cues or clips - call them
whatever you prefer), attach a description and meta data to it, attach
an outgoing hyperlink to it, and address these segments directly
through URLs. If this feels almost like a web page, then that's
exactly what we intended to achieve.

In addition to this author-controlled creation of cue ranges, we also
allowed for the creation of temporal hyperlinks, which would point
directly to a time-defined (dynamic) segment inside a video. This is
now being examined more closely in the new W3C Media Fragments Working
Group http://www.w3.org/2008/01/media-fragments-wg.html .

But I digress...

Taking the definition of cue ranges out of html and including it into
the media content itself, but providing a similarly simple markup
language to create the segmentation makes a lot of sense. Above
everything else, it reduces the complexity of the HTML specification
and puts the definition of the segmentation into the hands of the
person that would create it: the video content author.

But you want to stay flexible with the segmentation since it may be
needed in multiple representations:

* you may want to have it "burnt" into the video such that every copy
of the video continues retaining the segmentation created by the
author - for this case we created a representation of CMML that is a
binary interleave of the original video file and the CMML temporally
multiplexed into it such that the right right cues are aligned with
the video data they refer to. The multiplexing is done to allow for
live streaming of such cues with the video within one network
connection. This is what we called an Annodex stream (annotated and
indexed video).

* you may want to keep your cues and associated data in a database and
only create the CMML and/or the Annodex stream upon a user request.
This is similar to the dynamic creation of a Web page from a database.

* or you may indeed want to continue keeping one or more cue range
segmentations in separate CMML files aside the original video file to
make the cues and annotations for a video available in separation of
the video (e.g. for use by a search engine crawler). Imagine Google
could index deeply inside a video because the cues and annotations of
the video are made available in a standard crawlable format.

In such a scenario, all you need to do in the video element is the
creation of a set of javascript API calls that can directly make use
of the information defined in the CMML file, like is demonstrated in
this video: http://au.youtube.com/watch?v=LbWb1dkvm0s
The code for this demo is available here:
http://svn.annodex.net/browser_plugin/trunk/test/test.html .

Notice how the problem of addressing cues has been taken totally out
of the javascript API - all we do in javascript is address time
offsets. The semantics of the time offsets is stored in the
annotations, which can be retrieved using their own javascript API
call.

Cheers,
Silvia.

On Sat, Jul 12, 2008 at 4:00 PM, Philip Jägenstedt <philipj at opera.com> wrote:
> Just to add some of my thought on cue ranges.
>
> Like Dave, I am not terribly enthusiastic about the current cue ranges
> spec, which strikes me adding a fair amount of complexity and yet
> doesn't solve the basic use case in straightforward manner.
>
> If I were a content author and looked at the available options to
> display subtitles, I would probably simply add a timeupdate event
> listener and use e.target.currentTime to decide on an action to take.
> While lexical closures are fun and useful, depending on them isn't
> terribly nice to those who don't have experience with functional
> programming (you can use ECMAScript without realizing that it's a
> function language, so it doesn't count).
>
> I agree that proper events make a lot of sense here instead of
> callbacks. We could use some new event -- CueEvent maybe -- which would
> actually include the start and stop times and a reference to the target
> HTMLMediaElement. I might suggest a modified addCueRange which takes a
> data argument which is also passed along in the event object.
>
> If we support external annotations we need some open format for this
> which all browsers can support. I'm not very familiar with SMIL, but it
> looks like a Swiss army knife. Perhaps http://www.annodex.net/ is also
> worth a closer look: "CMML is a HTML-like markup language for
> time-continuous data such as audio/video." Then there's the new
> http://www.w3.org/2008/01/media-annotations-wg.html which has a
> relevant-sounding name, but I'm not sure it really is.
>
> Finally, has any browser implemented the current cue ranges API yet? If
> not, it's not too late to come up with something that we can all feel a
> bit happier about.
>
> Philip
>
> On Wed, 2008-07-09 at 10:37 +0200, Dave Singer wrote:
>> OK, some comments back on the cue range design.  Sorry for the
>> summer-vacation-induced delay in response!
>>
>>
>> At 1:00  +0000 12/06/08, Ian Hickson wrote:
>> >  > In the current HTML5 draft cue ranges are available using a DOM API.
>> >>
>> >>  This way of doing ranges is less than ideal.
>> >>
>> >>  First of all, it is hard to use. The ranges must be added by script,
>> >>  can't be supplied with the media, and the callbacks are awkward to
>> >>  handle. The only way to identify the range a received callback applies
>> >>  to is by creating not one but two separate functions for each range: one
>> >>  for enter, one for exit. While creating functions on-demand is easy in
>> >>  JavaScript it does fall under advanced techniques that most authors will
>> >>  be unfamiliar with.
>> >
>> >One of the features proposed for the next version of the video API is
>> >chapter markers and other embedded timed metadata, with corresponding
>> >callbacks for authors to hook into. Would that resolve the problem you
>> >mention?
>>
>> It may be that if we can define a way to embed cue-range-generating
>> meta-data in the media resource, with an abstract 'api' to get it
>> out, we'd deal with the "only add by script" issue here, yes.  The
>> others, not so much.
>>
>> Using elements makes ranges identifiable, traversable and modifiable
>> by using familiar APIs and concepts. However it is true that there
>> are other ways to get some of the same functionality. Unless the
>> elements have some non-scripting functionality (like linking) the
>> case is perhaps not totally compelling. Instantiating ranges from
>> custom markup using script is a possibility.
>>
>> Overall, we remain concerned that typically it is the media author
>> who would define what the ranges are, not really the page or
>> particularly the script author.  Media authors tend not to be happy
>> writing scripts.
>>
>> >
>> >>  This kind of feature is also not available in all languages that might
>> >>  provide access to the DOM API.
>> >
>> >JavaScript is really the only concern from HTML5's point of view; if other
>> >languages become relevant, they should get specially-crafted APIs for
>> >them when it comes to this kind of issue.
>>
>> The problem is that the current API more or less requires use of
>> closures and currying except for trivial cases. We don't think that
>> is good API design even for languages that have them.  Perhaps at the
>> very least a cookie could be passed?
>>
>> >
>> >>  Secondly this mechanism is not very powerful. You can't do anything else
>> >>  with the ranges besides receiving callbacks and removing them. You can't
>> >>  modify them. They are not visible to scripts or CSS. You can't link to
>> >>  them. You can't link out from them.
>> >
>> >I'm not sure what it would really mean to link to or from a range, unless
>> >you turned the entire video into a link, in which case you can just wrap
>> >the <video> in an <a href=""> element for the duration of the range, using
>> >script.
>>
>> Linking into a cue-range would be using its beginning or end as a
>> seek point, or its duration as a restricted view of the media ("only
>> show me cue-range called InTheBathroom").  Linking out of a cue-range
>> would be establishing a click-through URL that would be dispatched
>> directly if the user clicked on the media during that range
>> (dispatched without script).  We agree that neither of these should
>> be in scope now, but it would be nice to have a framework that could
>> be extended to cover these, in future.
>>
>> >  > Thirdly, a script is somewhat strange place to define the ranges. A set
>> >>  of ranges usually relates closely to some particular piece of media
>> >>  content. The same set of ranges rarely makes much sense in the context
>> >>  of some other content. It seems that ranges should be defined or
>> >>  supplied along with the media content.
>> >
>> >For in-band data, callbacks for chapter markers as mentioned earlier seem
>> >like the best solution.
>> >
>> >For out-of-band data, if the ranges are just intended to trigger script, I
>> >don't think we gain much from providing a way to mark up ranges semi-
>> >declaratively as opposed to just having HTML-based media players define
>> >their own range markup and have them implement it using this API. It
>> >wouldn't be especially hard.
>>
>> This seems to conflict with the answer (1) above, doesn't it?
>>
>> >  > Fourth, this kind of callback API is pretty strange creature in the HTML
>> >>  specification. The only other callback APIs are things like setTimeout()
>> >>  and the new SQL API which don't have associated elements. Events are the
>> >>  callback mechanism for everything else.
>> >
>> >Events use callbacks themselves, so it's not that unusual.
>> >
>> >I don't really think events would be a good interface for this.
>> >Consistency is good, but if one can come up with a better API, it's better
>> >to use that than just be consistent for the sake of it.
>>
>> It does seem strange that events are right in the spatial domain
>> (mouse enter/exit), but not in the temporal domain.  Yet the basic
>> semantic of the english word "event", let alone the web meaning, is
>> pretty well exactly matched by what is happening here -- crossing a
>> temporal boundary!  Events are well-known and design uniformity
>> suggests that they be used, if nothing else.
>>
>> >  > In SMIL the equivalent concept is the <area> element which is
>> >used like this:
>> >  > <video src="http://www.example.org/CoolStuff">
>> >>             <area id="area1" begin="0s" end="5s"/>
>> >>             <area id="area2" begin="5s" end="10s"/>
>> >>  </video>
>> >>
>> >>  This kind of approach has several advantages.
>> >>  * Ranges are defined as part of the document, in the context of a particular
>> >>  media stream.
>> >
>> >I'm not sure why that is an advantage in the context of HTML.
>>
>> Because it is declarative and 'close to' (or maybe later, even
>> within) the media resource.
>>
>> >  > * This uses events, a more flexible and more appropriate callback
>> >mechanism.
>> >
>> >I don't really see why the flexibility of events is useful here, and I
>> >don't see why it's more appropriate.
>>
>> But we ask the opposite: why is it compelling not to fit into the normal way of
>>
>> >  > * The callbacks have a JavaScript object associated with them, namely a DOM
>> >>  element, which carries information about the range.
>> >
>> >That's useful, yes. Should we include some data with the callback?
>>
>> Yes, if we cannot agree on this proposal, then some sort of cookie or
>> ID should be associated with a cue range (a string name of the range,
>> for example).
>>
>> >We
>> >could include the class name, the start time, and the end time. Having
>> >said that, it's easy to use currying here to hook callbacks that know what
>> >they're expecting.
>>
>> Currying is pretty advanced;  we're already concerned about using
>> scripting at all!
>>
>> >  > We would like to suggest a <timerange> element that can be used as a
>> >>  child of the <video> and <audio> elements.
>> >
>> >It's not clear to me that this is solving any problems worth solving.
>>
>> Well, we think we should first evaluate the two ways of doing this,
>> and then give weight, if appropriate, to the 'first written' way
>> (yours).  We're technically still in WD so we should, if possible,
>> prefer the better solution.
>>
>> Let's look at a few comparison axes:
>>
>> Declarative or established by script?  We prefer declarative, as we
>> think the most likely definers of what the cue-ranges are (as opposed
>> to how they are handled) are the media authors, not the page authors.
>>
>> Events or callbacks?  Since we see these as the temporal equivalent
>> of the spatial mouse events, we see events as the most natural
>> analog.  They also have event identifiers, making it much easier to
>> have separate handlers for different ranges or events.
>>
>> Provide a framework for talking about time-ranges for other purposes
>> such as linking in or out?  Yes, annotated ranges like ours do
>> provide such a basis.
>>
>> Makes the DTD and HTML5 spec. more complex?  Yes, we agree that this
>> introduces another element into the spec., with all that implies.
>>
>>
>> * *
>>
>> Here are some more general ideas (not all meshed together):
>>
>> * stating that the abstract interface to a media resource includes
>> finding its 'cue ranges', and inserting them automatically, and the
>> definers of a media resource type (e.g. MPEG for MP4) can define
>> something like "property X maps to HTML5 cue ranges in the following
>> way" would be OK.  But I think again, then, that they have to be
>> annotational, so that they can have an ID and make an event....
>>
>> * adding a cookie/rangeID to the current API would help...
>>
>> * adding an attribute to <source> called "annotations" which could
>> point at a variety of types, including at an XML file (to be defined)
>> which contains meta-data, cue-range definitions etc., as if they were
>> part of the media, would help move this out of the HTML5 but still
>> provide a uniform interface...
>>
>> example
>>           <source src="myMovie.mp4" annotations="myMovie-tags.xml" />
>>
>>
>> then if the annotations should be got from the media resource itself,
>> the notation
>>           <source src="myMovie.mp4" annotations="myMovie.mp4" />
>> could be used, and
>>               <source src="myMovie.mp4"  />
>> would be equivalent.
>>
>> we could even use
>>           <source src="myMovie.mp4" annotations="" />
>> to explicitly defeat the retrieval of annotations.
>>
>>
>> (Such an "annotations" href might also help with associating metadata
>> with media resources, particularly when the same metadata should be
>> associated with a set of sources that differ in bitrate, codec, etc.).
> --
> Philip Jägenstedt
> Opera Software
>
>