[whatwg] Cue points in media elements

Ralph Giles giles at xiph.org
Mon Apr 30 16:15:07 PDT 2007

Thanks for adding to the discussion. We're very interested in 
implementing support for presentations as well, so it's good
to hear from someone with experience. 

Since we work on streaming media formats, I always assumed things would 
have to be broken up by the server and the various components streamed 
separately to a browser, and I hadn't noticed the cue point support 
until you pointed it out.

Some comments and questions below...

On Sun, Apr 29, 2007 at 03:14:27AM -0400, Brian Campbell wrote:

> in our language, you might see something like this:
>   (movie "Foo.mov" :name 'movie)
>   (wait @movie (tc 2 3))
>   (show @bullet-1)
>   (wait @movie)
>   (show @bullet-2)
> If the user skips to the end of the media clip, that simply causes  
> all WAITs on that  media clip to return instantly. If they skip  
> forward in the media clip, without ending it, all WAITs before that  
> point will return instantly.

How does this work if, for example, the user seeks forward, and then
back to an earlier position? Would some of the 'show's be undone, or do 
they not seek backward with the media playback? Is the essential 
component of your system that all the shows be called in sequence 
to build up a display state, or that the last state trigger before the 
current playback point have been triggered? Isn't this slow if a bunch 
of intermediate animations are triggered by a seek?

Does your system support live streaming as well? That complicates the 
design some when the presentation media updates appear dynamically.

Anyway I think you could implement your system with the currently 
proposed interface by checking the current playback position and 
clearing a separate list of waits inside your timeupdate callback.

> This is a nice system, but I can't see how even as simple a system as  
> this could be implemented given the current specification of cue  
> points. The problem is that the callbacks execute "when the current  
> playback position of a media element reaches" the cue point. It seems  
> unclear to me what "reaching" a particular time means.

I agree this should be clarified. The appropriate interpretation should 
be when the current playback position reaches the frame corresponding to 
the queue point, but digital media has quantized frames, while the cue 
points are floating point numbers. Triggering all cue point callbacks 
between the last current playback position and the current one 
(including during seeks) would be one option, and do what you want as 
long as you aren't seeking backward. I'd be more in favor of triggering
any cue point callbacks that lie between the current playback position 
and the current playback position of the next frame (audio frame for 
<audio/> and video frame for <video/> I guess). That means more 
bookkeeping to implement your system, but is less surprising in other 

>                                                           If video  
> playback freezes for a second, and so misses a cue point, is that  
> considered to have been "reached"?

As I read it, cue points are relative to the current playback position, 
which does not advance if the stream buffer underruns, but it would
if playback restarts after a gap, as might happen if the connection
drops, or in an RTP stream. My proposal above would need to be amended
to handle that case, and the decoder dropping frames...finding the right 
language here is hard.

> In the current spec, all that is  
> provided for is controls to turn closed captions on or off. What  
> would be much better is a way to enable the video element to send  
> caption events, which include the text of the current caption, and  
> can be used to display those captions in a way that fits the design  
> of the content better.

I really like this idea. It would also be nice if, for example, the 
closed caption text were available through the DOM so it could be
presented elsewhere, searched locally, and so on. But what about things 
like album art, which might be embedded in an audio stream? Should that 
be accessible? Should a video element expose a set of known cue points 
embedded in the file? 

A more abstract interface is necessary than just 'caption events'. Here 
are some use cases worth considering:

* A media file has embedded textual metadata like title, author, 
copyright license, that the designer would like to access for associated 
display elsewhere in the page, or to alter the displayed user interface
based on the metadata. This is pretty essential for parity with 
flash-based internet radio players.

* A media file has embedded non-textual metadata like an album cover 
image, that the designer would like to access for display elsewhere in
the page.

* The designer wants to access closed captioned or subtitle text 
through the DOM as it becomes available for display elsewhere in the 

* There are points in the media file where the embedded metadata 
changes. These points cannot be retrieved without scanning the file, 
which is expensive over the network, and may not be possible in general 
if the stream is a live feed. Nevertheless, the designer wants to be 
notified when the associated metadata changes so other elements can be 
updated. This is in fact the normal case for http streaming internet 
radio with either Ogg Vorbis or mp3.

* The media file natively contains an embedded set of cue points, which 
the designer wants to access to display a hyperlinked table of contents 
for the media file. This is possible with a CMML substream in Ogg 
Theora, or with chapter tables in Quicktime and AVI files.

* The media file records a presentation, including still images of the 
slides at the times they were presented. These are distinct from the 
"album art" for the stream, which is a photo of the speaker.

All of these can be handled by special server-side components and AJAX, 
for example, so the main question is whether the media elements should 
expose this sort of data through the DOM.


More information about the whatwg mailing list