[whatwg] Cue points in media elements
giles at xiph.org
Mon Apr 30 16:15:07 PDT 2007
Thanks for adding to the discussion. We're very interested in
implementing support for presentations as well, so it's good
to hear from someone with experience.
Since we work on streaming media formats, I always assumed things would
have to be broken up by the server and the various components streamed
separately to a browser, and I hadn't noticed the cue point support
until you pointed it out.
Some comments and questions below...
On Sun, Apr 29, 2007 at 03:14:27AM -0400, Brian Campbell wrote:
> in our language, you might see something like this:
> (movie "Foo.mov" :name 'movie)
> (wait @movie (tc 2 3))
> (show @bullet-1)
> (wait @movie)
> (show @bullet-2)
> If the user skips to the end of the media clip, that simply causes
> all WAITs on that media clip to return instantly. If they skip
> forward in the media clip, without ending it, all WAITs before that
> point will return instantly.
How does this work if, for example, the user seeks forward, and then
back to an earlier position? Would some of the 'show's be undone, or do
they not seek backward with the media playback? Is the essential
component of your system that all the shows be called in sequence
to build up a display state, or that the last state trigger before the
current playback point have been triggered? Isn't this slow if a bunch
of intermediate animations are triggered by a seek?
Does your system support live streaming as well? That complicates the
design some when the presentation media updates appear dynamically.
Anyway I think you could implement your system with the currently
proposed interface by checking the current playback position and
clearing a separate list of waits inside your timeupdate callback.
> This is a nice system, but I can't see how even as simple a system as
> this could be implemented given the current specification of cue
> points. The problem is that the callbacks execute "when the current
> playback position of a media element reaches" the cue point. It seems
> unclear to me what "reaching" a particular time means.
I agree this should be clarified. The appropriate interpretation should
be when the current playback position reaches the frame corresponding to
the queue point, but digital media has quantized frames, while the cue
points are floating point numbers. Triggering all cue point callbacks
between the last current playback position and the current one
(including during seeks) would be one option, and do what you want as
long as you aren't seeking backward. I'd be more in favor of triggering
any cue point callbacks that lie between the current playback position
and the current playback position of the next frame (audio frame for
<audio/> and video frame for <video/> I guess). That means more
bookkeeping to implement your system, but is less surprising in other
> If video
> playback freezes for a second, and so misses a cue point, is that
> considered to have been "reached"?
As I read it, cue points are relative to the current playback position,
which does not advance if the stream buffer underruns, but it would
if playback restarts after a gap, as might happen if the connection
drops, or in an RTP stream. My proposal above would need to be amended
to handle that case, and the decoder dropping frames...finding the right
language here is hard.
> In the current spec, all that is
> provided for is controls to turn closed captions on or off. What
> would be much better is a way to enable the video element to send
> caption events, which include the text of the current caption, and
> can be used to display those captions in a way that fits the design
> of the content better.
I really like this idea. It would also be nice if, for example, the
closed caption text were available through the DOM so it could be
presented elsewhere, searched locally, and so on. But what about things
like album art, which might be embedded in an audio stream? Should that
be accessible? Should a video element expose a set of known cue points
embedded in the file?
A more abstract interface is necessary than just 'caption events'. Here
are some use cases worth considering:
* A media file has embedded textual metadata like title, author,
copyright license, that the designer would like to access for associated
display elsewhere in the page, or to alter the displayed user interface
based on the metadata. This is pretty essential for parity with
flash-based internet radio players.
* A media file has embedded non-textual metadata like an album cover
image, that the designer would like to access for display elsewhere in
* The designer wants to access closed captioned or subtitle text
through the DOM as it becomes available for display elsewhere in the
* There are points in the media file where the embedded metadata
changes. These points cannot be retrieved without scanning the file,
which is expensive over the network, and may not be possible in general
if the stream is a live feed. Nevertheless, the designer wants to be
notified when the associated metadata changes so other elements can be
updated. This is in fact the normal case for http streaming internet
radio with either Ogg Vorbis or mp3.
* The media file natively contains an embedded set of cue points, which
the designer wants to access to display a hyperlinked table of contents
for the media file. This is possible with a CMML substream in Ogg
Theora, or with chapter tables in Quicktime and AVI files.
* The media file records a presentation, including still images of the
slides at the times they were presented. These are distinct from the
"album art" for the stream, which is a photo of the speaker.
All of these can be handled by special server-side components and AJAX,
for example, so the main question is whether the media elements should
expose this sort of data through the DOM.
More information about the whatwg