[whatwg] re-thinking "cue ranges"

Tue Jul 22 02:58:25 PDT 2008

On Wed, 9 Jul 2008, Dave Singer wrote:
> > 
> > One of the features proposed for the next version of the video API is 
> > chapter markers and other embedded timed metadata, with corresponding 
> > callbacks for authors to hook into. Would that resolve the problem you 
> > mention?
> 
> It may be that if we can define a way to embed cue-range-generating 
> meta-data in the media resource, with an abstract 'api' to get it out, 
> we'd deal with the "only add by script" issue here, yes.

Ok.

> Overall, we remain concerned that typically it is the media author who 
> would define what the ranges are, not really the page or particularly 
> the script author.  Media authors tend not to be happy writing scripts.

I totally agree, but that's what the in-media annotations, and future APIs 
that deal with them, are for.

> > JavaScript is really the only concern from HTML5's point of view; if 
> > other languages become relevant, they should get specially-crafted 
> > APIs for them when it comes to this kind of issue.
> 
> The problem is that the current API more or less requires use of 
> closures and currying except for trivial cases. We don't think that is 
> good API design even for languages that have them.  Perhaps at the very 
> least a cookie could be passed?

Done.

> > > Secondly this mechanism is not very powerful. You can't do anything 
> > > else with the ranges besides receiving callbacks and removing them. 
> > > You can't modify them. They are not visible to scripts or CSS. You 
> > > can't link to them. You can't link out from them.
> > 
> > I'm not sure what it would really mean to link to or from a range, 
> > unless you turned the entire video into a link, in which case you can 
> > just wrap the <video> in an <a href=""> element for the duration of 
> > the range, using script.
> 
> Linking into a cue-range would be using its beginning or end as a seek 
> point, or its duration as a restricted view of the media ("only show me 
> cue-range called InTheBathroom").  Linking out of a cue-range would be 
> establishing a click-through URL that would be dispatched directly if 
> the user clicked on the media during that range (dispatched without 
> script).  We agree that neither of these should be in scope now, but it 
> would be nice to have a framework that could be extended to cover these, 
> in future.

Jumping into a point of video is supported using other aspects of the API 
(setting 'currentTime'); looping a certain part similarly has a dedicated 
API ('loopStart' etc). I don't know that we'd ever want to use the cue 
ranges for those purposes. I don't really understand the use cases.

> > > Thirdly, a script is somewhat strange place to define the ranges. A 
> > > set of ranges usually relates closely to some particular piece of 
> > > media content. The same set of ranges rarely makes much sense in the 
> > > context of some other content. It seems that ranges should be 
> > > defined or supplied along with the media content.
> > 
> > For in-band data, callbacks for chapter markers as mentioned earlier 
> > seem like the best solution.
> > 
> > For out-of-band data, if the ranges are just intended to trigger 
> > script, I don't think we gain much from providing a way to mark up 
> > ranges semi- declaratively as opposed to just having HTML-based media 
> > players define their own range markup and have them implement it using 
> > this API. It wouldn't be especially hard.
> 
> This seems to conflict with the answer (1) above, doesn't it?

How so?

> > > Fourth, this kind of callback API is pretty strange creature in the 
> > > HTML specification. The only other callback APIs are things like 
> > > setTimeout() and the new SQL API which don't have associated 
> > > elements. Events are the callback mechanism for everything else.
> > 
> > Events use callbacks themselves, so it's not that unusual.
> > 
> > I don't really think events would be a good interface for this. 
> > Consistency is good, but if one can come up with a better API, it's 
> > better to use that than just be consistent for the sake of it.
> 
> It does seem strange that events are right in the spatial domain (mouse 
> enter/exit), but not in the temporal domain.  Yet the basic semantic of 
> the english word "event", let alone the web meaning, is pretty well 
> exactly matched by what is happening here -- crossing a temporal 
> boundary!  Events are well-known and design uniformity suggests that 
> they be used, if nothing else.

An event is fired whenever a cue range is entered or exited (timeupdate), 
but I really don't think events are appropriate for the cue ranges 
themselves. To start with, it would decouple the range registration from 
the event registration. It would also mean losing the ability to register 
event listeners for cue ranges of a particular class rather than all of 
them. I'm also not sure we really want the whole capture/bubble mechanism 
for these callbacks, not to mention the ability for one callback to cancel 
another one, etc. Events just seem like a very blunt and heavy weapon for 
this task.

> > > In SMIL the equivalent concept is the <area> element which is used 
> > > like this:
> > >
> > > <video src="http://www.example.org/CoolStuff">
> > >             <area id="area1" begin="0s" end="5s"/>
> > >             <area id="area2" begin="5s" end="10s"/>
> > >  </video>
> > > 
> > > This kind of approach has several advantages.
> > > * Ranges are defined as part of the document, in the context of a 
> > > particular media stream.
> > 
> > I'm not sure why that is an advantage in the context of HTML.
> 
> Because it is declarative and 'close to' (or maybe later, even within) 
> the media resource.

If it's within, then it's not in the HTML, it's in the video stream, in 
which case the future APIs for in-band timed metadata are the solution we 
should use.

I don't see how being declarative and "close to" the media resource is an 
advantage in this case. Take, for example, YouTube. It seems much more 
likely that out-of-band timing metdata would be sent as a separate 
resource fetched using XMLHttpRequest than that the data would be included 
inline with the HTML markup.

> > > * This uses events, a more flexible and more appropriate callback 
> > > mechanism.
> > 
> > I don't really see why the flexibility of events is useful here, and I 
> > don't see why it's more appropriate.
> 
> But we ask the opposite: why is it compelling not to fit into the normal 
> way of [truncated in original]

See above for a list of ways in which events are inappropriate here.

> > > * The callbacks have a JavaScript object associated with them, 
> > > namely a DOM element, which carries information about the range.
> > 
> > That's useful, yes. Should we include some data with the callback?
> 
> Yes, if we cannot agree on this proposal, then some sort of cookie or ID 
> should be associated with a cue range (a string name of the range, for 
> example).

I've added an ID (arbitrary string) that is passed to the callback.

> > We could include the class name, the start time, and the end time. 
> > Having said that, it's easy to use currying here to hook callbacks 
> > that know what they're expecting.
> 
> Currying is pretty advanced;  we're already concerned about using 
> scripting at all!

You need to use scripting anyway to make anything happen with the cue 
range. I don't see why scripting to register the range as well is any kind 
of a burden.

> > > We would like to suggest a <timerange> element that can be used as a 
> > > child of the <video> and <audio> elements.
> > 
> > It's not clear to me that this is solving any problems worth solving.
> 
> Well, we think we should first evaluate the two ways of doing this, and 
> then give weight, if appropriate, to the 'first written' way (yours).  
> We're technically still in WD so we should, if possible, prefer the 
> better solution.

Agreed.

> Let's look at a few comparison axes:
> 
> Declarative or established by script?  We prefer declarative, as we 
> think the most likely definers of what the cue-ranges are (as opposed to 
> how they are handled) are the media authors, not the page authors.

Declarative mechanisms make sense and are generally to be preferred if one 
can express everything statically. However, the whole point of the cue 
range API is to provide hooks for doing things dynamically that are not 
provided for normally, such as synchronising a video with a slide deck.

> Events or callbacks?  Since we see these as the temporal equivalent of 
> the spatial mouse events, we see events as the most natural analog.  

See above for why I disagree that events make sense here.

> They also have event identifiers, making it much easier to have separate 
> handlers for different ranges or events.

This is no longer a differentiating feature.

> Provide a framework for talking about time-ranges for other purposes 
> such as linking in or out?  Yes, annotated ranges like ours do provide 
> such a basis.

We already have those abilities. It's unclear why we would need a new 
mechanism to do this or why such a new mechanism would be a good idea.

The real key is: what are the use cases you are envisaging? It seems to me 
that we must have a different set of ideas about what this API is for.

> * adding an attribute to <source> called "annotations" which could point 
> at a variety of types, including at an XML file (to be defined) which 
> contains meta-data, cue-range definitions etc., as if they were part of 
> the media, would help move this out of the HTML5 but still provide a 
> uniform interface...

We might want to look at this in a future version, but I think we should 
get the basic API nailed down before inventing new formats!

On Sat, 12 Jul 2008, Philip Jägenstedt wrote:
> 
> Like Dave, I am not terribly enthusiastic about the current cue ranges 
> spec, which strikes me adding a fair amount of complexity and yet 
> doesn't solve the basic use case in straightforward manner.

What are the use cases you think are basic? It's unclear to me what isn't 
being solved. Here's one use case, a slide deck:

   <video src="talk.video" controls></video>
   <div id="slides"></div>
   <script>
     var deck;
     var video = document.getElementsByTagName('video')[0];
     var slides = document.getElementsById('slides');
     var x = new XMLHttpRequest();
     x.onreadystatechange = loadSlides;
     x.open("GET", "slides.xml");
     x.send();
     function loadSlides(e) {
       if (x.readyState == 4) {
         deck = x.responseXML.getElementsByTagName('slide');
         for (var i = 0; i < decks.length; i += 1) {
           video.addCueRange('slide', i,
                             decks[i].getAttribute('start'),
                             decks[i].getAttribute('end'),
                             false, slideEnter, null);
         }
       }
     }
     function slideEnter(slide) {
       slides.innerHTML = deck[slide].getAttribute('html');
     }
   </script>

...this would work with a slides.xml file that looked like:

   <slides>
    <slide start="0" end="10" html="Hello!"/>
    <slide start="10" end="10" html="&lt;em&gt;World!&lt/em&gt;"/>
    <slide start="20" end="60" html="How are you?"/>
   </slides>

This is obviously a simple example, but it doesn't seem like excessive 
complexity. Indeed the actual cue-range-specific parts are a tiny part of 
the whole thing here, the XHR boilerplate is more lines of code.

> If I were a content author and looked at the available options to 
> display subtitles, I would probably simply add a timeupdate event 
> listener and use e.target.currentTime to decide on an action to take. 
> While lexical closures are fun and useful, depending on them isn't 
> terribly nice to those who don't have experience with functional 
> programming (you can use ECMAScript without realizing that it's a 
> function language, so it doesn't count).
> 
> I agree that proper events make a lot of sense here instead of 
> callbacks. We could use some new event -- CueEvent maybe -- which would 
> actually include the start and stop times and a reference to the target 
> HTMLMediaElement. I might suggest a modified addCueRange which takes a 
> data argument which is also passed along in the event object.

Does the identifier argument address this sufficiently?

> If we support external annotations we need some open format for this 
> which all browsers can support. I'm not very familiar with SMIL, but it 
> looks like a Swiss army knife. Perhaps http://www.annodex.net/ is also 
> worth a closer look: ï»¿"CMML is a HTML-like markup language for 
> time-continuous data such as audio/video." Then there's the new 
> http://www.w3.org/2008/01/media-annotations-wg.html which has a 
> relevant-sounding name, but I'm not sure it really is.

I'd rather wait til we have more experience in shipping browsers with this 
before adding more functionality like that.

On Wed, 16 Jul 2008, Silvia Pfeiffer wrote:
>
> [...] In such a scenario, all you need to do in the video element is the 
> creation of a set of javascript API calls that can directly make use of 
> the information defined in the CMML file, like is demonstrated in this 
> video: http://au.youtube.com/watch?v=LbWb1dkvm0s The code for this demo 
> is available here: 
> http://svn.annodex.net/browser_plugin/trunk/test/test.html .
>
> Notice how the problem of addressing cues has been taken totally out of 
> the javascript API - all we do in javascript is address time offsets. 
> The semantics of the time offsets is stored in the annotations, which 
> can be retrieved using their own javascript API call.

That's basically what we have now.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'