[whatwg] Cue points in media elements

Thu Oct 18 17:41:23 PDT 2007

On Sun, 29 Apr 2007, Brian Campbell wrote:
> 
> The problem is that the callbacks execute "when the current playback 
> position of a media element reaches" the cue point. It seems unclear to 
> me what "reaching" a particular time means. If video playback freezes 
> for a second, and so misses a cue point, is that considered to have been 
> "reached"? Is there any way that you can guarantee that a cue point will 
> be executed as long as video has passed a particular cue point? With a 
> lot of bookkeeping and the "timeupdate" event along with the cue points, 
> you may be able to keep track of the current time in the movie well 
> enough to deal with the user skipping forward, pausing, and the video 
> stalling and restarting due to running out of buffer. This doesn't 
> address, as far as I can tell, issues like the thread displaying the 
> video pausing for whatever reason and so skipping forward after it 
> resumes, which may cause cue points to be lost, and which isn't 
> specified to send a "timeupdate" event.

I've defined what "reaching" a particular time means. I have explicitly 
made it invoke the times that might get skipped due to missing frames 
during normal playback. I have also made it _not_ fire the callbacks for 
times in between the old and new positions when seeking.

> Basically, what is necessary is a way to specify that a cue point should 
> always be fired as long as playback has passed a certain time, not just 
> if it "reaches" a particular time. This would prevent us from having to 
> do a lot of bookkeeping to make sure that cue points haven't been 
> missed, and make everything simpler and less fragile.

You can use the "timeupdate" event for this -- it fires whenever a cue 
point is hit, and whenever the timeline is seeked (even implicitly by the 
looping algorithm).

> For now, we are focusing on captioning for the deaf. We have voiceovers 
> on some screens with no associated video, video that appears in various 
> places on the screen, and the occasional sound effects. Because there is 
> not a consistent video location, nor is there even a frame for 
> voiceovers to appear in, we don't display the captions directly over the 
> video, but instead send events to the current screen, which is 
> responsible for catching the events and displaying them in a location 
> appropriate for that screen, usually a standard location. In the current 
> spec, all that is provided for is controls to turn closed captions on or 
> off. What would be much better is a way to enable the video element to 
> send caption events, which include the text of the current caption, and 
> can be used to display those captions in a way that fits the design of 
> the content better.

I've added this to the list for version 2 features. I'm interested in 
seeing what the requirements are for captions before we go ahead and spec 
them in too much detail. Implementation feedback will be helpful here.

Thanks for your feedback!

On Mon, 30 Apr 2007, Ralph Giles wrote:
> 
> I'd be more in favor of triggering any cue point callbacks that lie 
> between the current playback position and the current playback position 
> of the next frame (audio frame for <audio/> and video frame for <video/> 
> I guess). That means more bookkeeping to implement your system, but is 
> less surprising in other cases.

Could you elaborate on this? Right now the system triggers cue points up 
to the current displayed frame, and some cue points between the current 
frame and the next frame, if the gap between the frames is long enough 
that the time updates more often than the framerate.

> As I read it, cue points are relative to the current playback position, 
> which does not advance if the stream buffer underruns, but it would if 
> playback restarts after a gap, as might happen if the connection drops, 
> or in an RTP stream. My proposal above would need to be amended to 
> handle that case, and the decoder dropping frames...finding the right 
> language here is hard.

Does the new text work for this?

> A more abstract interface is necessary than just 'caption events'. Here 
> are some use cases worth considering:
> 
> * A media file has embedded textual metadata like title, author, 
> copyright license, that the designer would like to access for associated 
> display elsewhere in the page, or to alter the displayed user interface 
> based on the metadata. This is pretty essential for parity with 
> flash-based internet radio players.
> 
> * The designer wants to access closed captioned or subtitle text 
> through the DOM as it becomes available for display elsewhere in the 
> page.
> 
> * There are points in the media file where the embedded metadata 
> changes. These points cannot be retrieved without scanning the file, 
> which is expensive over the network, and may not be possible in general 
> if the stream is a live feed. Nevertheless, the designer wants to be 
> notified when the associated metadata changes so other elements can be 
> updated. This is in fact the normal case for http streaming internet 
> radio with either Ogg Vorbis or mp3.
> 
> * The media file natively contains an embedded set of cue points, which 
> the designer wants to access to display a hyperlinked table of contents 
> for the media file. This is possible with a CMML substream in Ogg 
> Theora, or with chapter tables in Quicktime and AVI files.

These are already no the v2 feature request list.

> * A media file has embedded non-textual metadata like an album cover 
> image, that the designer would like to access for display elsewhere in 
> the page.
> 
> * The media file records a presentation, including still images of the 
> slides at the times they were presented. These are distinct from the 
> "album art" for the stream, which is a photo of the speaker.

Are these common? These seem like very edge-case features.

On Tue, 1 May 2007, Kevin Calhoun wrote:
>
> I believe that a cue point is "reached" if its time is traversed during 
> playback.

That's what I made the spec say.

On Tue, 1 May 2007, Brian Campbell wrote:
> 
> What does "traversed" mean in terms of (a) seeking across the cue point 

Not traversed, per the new spec text.

> (b) playing in reverse (rewinding)

Ooh, hadn't thought of that one. Right now cue points are only reached 
when going forward. Should that change to work both ways? Anyone have an 
opinion on that?

> and (c) the media stalling an restarting at a later point in the stream?

That doesn't happen in the current model. If you stall, you pause.

On Wed, 2 May 2007, Dave Singer wrote:
> 
> I would say that playing (at any rate and in any direction) is a 
> continuous function, and therefore cue points are triggered, when 
> playing, whenever two samples of the time straddle the cue point (where 
> straddel includes one of the samples being at the cue point).

Backwards too? Wouldn't cue points usually be designed with an assumption 
that the video is going forward? (e.g. if you have a cue point for "show" 
and one for "hide", when going backwards you'd have to reverse them... it 
seems most authors wouldn't think of that, resulting in a bunch of errors 
when going backwards, if we were to actually fire the cue points.)

> Seeking is discontinuous, and therefore cue points are triggered only if 
> a seek results in landing on the cue point, if not playing.  If playing, 
> then the usual rules apply.

Agreed.

> Frame dropping, stalling, and so on, are aspects of the playback 
> behavior and nothing to do with the logical model of cues laid on a time 
> axis.

Agreed.

On Wed, 2 May 2007, Kevin Calhoun wrote:
> 
> A discontinuous jump will result in a timeupdate notification, which 
> among other things is supposed to enable scripts to issue notifications 
> of interesting times that are traversed not during playback but while 
> seeking.

Right.

On Tue, 1 May 2007, Brian Campbell wrote:
> 
> We don't expose arbitrary seeking controls to our users; just 
> play/pause, skip forward & back one card (which resets all state to a 
> known value) and skip past the current video/audio (which just causes 
> all waits on that media element to return instantly).

With <video> you can't stop your users from seeking; the user agent will 
allow arbitrary control over the video.

> Actually, that brings up another point, which is a bit more speculative. 
> It may be nice to have a way to register a callback that will be called 
> at animation rates (at least 15 frames/second or so) that is called with 
> the current play time of a media element. This would allow you to keep 
> animations in sync with video, even if the video might stall briefly, or 
> seek forward or backward for whatever reason. We haven't implemented 
> this in our current system (as I said, it still has the bug that 
> animations still take their full time to play even when you skip video), 
> but it may be helpful for this sort of thing.

Yeah, that would be interesting in a future version, I think.

> I agree, it would be possible, but from my current reading of the spec 
> it sounds like some cue points might be missed until quite a bit later 
> (since timeupdate isn't guaranteed to be called every time anything 
> discontinuous happens with the media).

When does timeupdate not fire? My intention is that timeupdate fires 
whenever something discontinuous happens, but I may have missed some 
cases. Could you elaborate?

> My instinct is to avoid trying to make a more general interface if 
> possible. There are endless types of access you can build to information 
> in underlying media elements, and I think it would put a large burden on 
> implementors if they had to support accessing all of those types of 
> information. Accessibility is one of the most important concerns in 
> HTML, however, so I think that having special case support for 
> accessibility without providing all of the other features would be an 
> acceptable tradeoff.

I agree, especially at this early stage.

On Tue, 1 May 2007, ddailey wrote:
> 
> I know SMIL seems funky to some people, but I do really love it! It is 
> so way cool! So far as I know it doesn't do quite what you're talking 
> about here, but it does similar stuff including non-linear distortions 
> of timing elements and the like.

That seems somewhat excessive for what we want. :-)

> It's declarative (though I don't think it's Turing complete -- wager of 
> virtual beans proposed) and its syntax is worthy of emulation in that 
> classical "ontology recapitulates philology" sort of sense. It is so 
> much a W3C standard that it has six or eight or twelve standards devoted 
> to it.

I'm somewhat wary of taking on something quite sa big as to have "six or 
eight or twelve standards devoted to it" for what is, for HTML, just a 
small feature as part of a much bigger language. I'm also wary of 
technologies that people are ready to make wagers regarding the turing 
completeness of the language.

On Tue, 1 May 2007, Billy Wong wrote:
>
> In order to capture this kind of situations, with flexibility in mind, I 
> think the concept of "cue points" may be changed to "cue periods"...
> 
> Method names:
> addEnterCuePeriod(time1, time2, callback)
> removeEnterCuePeriod(time1, time2, callback)
> addLeaveCuePeriod(time1, time2, callback)
> removeLeaveCuePeriod(time1, time2, callback)
> 
> The callback function mentioned by addEnterCuePeriod will be invoked 
> once when the video enter the period of time bounded by time1 and time2.  
> How the video get to a frame between time1 and time2 doesn't matter.  
> i.e.  the callback function may be invoked by a normally playing video 
> reaching time1, a video being fast forward / wind back into the period 
> between time1 & time2, or a particular timing between time1 & time2 of 
> the video being directly seek for.
> 
> The mechanism of LeaveCuePeriod is similar, while this time the callback 
> is invoked when the video leave the specified cue period.  (Or should 
> this pair of methods left out?)
> 
> With these four methods, one can not only achieve the "bullet point" 
> effect, but also video captions appearance and disappearance.

That's quite an interesting idea. I think it's separate from cue points, 
but I've added it to the v2 ideas list.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'