[whatwg] Cue points in media elements

Sun Apr 29 00:14:27 PDT 2007

I'm a developer of a custom engine for interactive multimedia, and  
I've recently noticed the work WHATWG has been doing on adding  
<video> and <audio> elements to HTML. I'm very glad to see these  
being proposed for addition to HTML, because if they (and several  
other features) are done right, it means that there may be a chance  
for us to stop using a custom engine, and use an off-the-shelf HTML  
engine, putting our development focus on our authoring tools instead.  
My hope is that eventually, if these features get enough penetration,  
to put our content up on the web directly, rather than having to  
distribute the runtime software with it.

I've taken a look at the current specification for media elements,  
and on the whole, it looks like it would meet our needs. We are  
currently using VP3, and a combination of MP3 and Vorbis audio, for  
our codecs, so having Ogg Theora (based on VP3) and Ogg Vorbis as a  
baseline would be completely fine with us, and much preferable to the  
patent issues and licensing fees we'd need to deal with if we used  
MPEG4.

For the sort of content that we produce, cue points are incredibly  
important. Most of our content consists of a video or voiceover  
playing while bullet points appear, animations play, and graphics are  
revealed, all in sync with the video. We have a very simple system  
for doing cue points, that is extremely easy for the content authors  
to write and is robust for paused media, media that is skipped to the  
end, etc. We simply have a blocking call, WAIT, that waits until a  
specific point or the end of a specified media element. For instance,  
in our language, you might see something like this:

   (movie "Foo.mov" :name 'movie)
   (wait @movie (tc 2 3))
   (show @bullet-1)
   (wait @movie)
   (show @bullet-2)

If the user skips to the end of the media clip, that simply causes  
all WAITs on that  media clip to return instantly. If they skip  
forward in the media clip, without ending it, all WAITs before that  
point will return instantly. If the user pauses the media clip, all  
WAITs on the media clip will block until it is playing again.

This is a nice system, but I can't see how even as simple a system as  
this could be implemented given the current specification of cue  
points. The problem is that the callbacks execute "when the current  
playback position of a media element reaches" the cue point. It seems  
unclear to me what "reaching" a particular time means. If video  
playback freezes for a second, and so misses a cue point, is that  
considered to have been "reached"? Is there any way that you can  
guarantee that a cue point will be executed as long as video has  
passed a particular cue point? With a lot of bookkeeping and the  
"timeupdate" event along with the cue points, you may be able to keep  
track of the current time in the movie well enough to deal with the  
user skipping forward, pausing, and the video stalling and restarting  
due to running out of buffer. This doesn't address, as far as I can  
tell, issues like the thread displaying the video pausing for  
whatever reason and so skipping forward after it resumes, which may  
cause cue points to be lost, and which isn't specified to send a  
"timeupdate" event.

Basically, what is necessary is a way to specify that a cue point  
should always be fired as long as playback has passed a certain time,  
not just if it "reaches" a particular time. This would prevent us  
from having to do a lot of bookkeeping to make sure that cue points  
haven't been missed, and make everything simpler and less fragile.

We're also greatly interested in making our content accessible, to  
meet Section 508 requirements. For now, we are focusing on captioning  
for the deaf. We have voiceovers on some screens with no associated  
video, video that appears in various places on the screen, and the  
occasional sound effects. Because there is not a consistent video  
location, nor is there even a frame for voiceovers to appear in, we  
don't display the captions directly over the video, but instead send  
events to the current screen, which is responsible for catching the  
events and displaying them in a location appropriate for that screen,  
usually a standard location. In the current spec, all that is  
provided for is controls to turn closed captions on or off. What  
would be much better is a way to enable the video element to send  
caption events, which include the text of the current caption, and  
can be used to display those captions in a way that fits the design  
of the content better.

I hope these comments make sense; let me know if you have any  
questions or suggestions.

Thanks,
Brian Campbell
Interactive Media Lab, Dartmouth College
http://iml.dartmouth.edu