[whatwg] How to handle multitrack media resources in HTML

Ian Hickson ian at hixie.ch
Wed Apr 20 17:22:01 PDT 2011

On Sun, 10 Apr 2011, Silvia Pfeiffer wrote:
> On Fri, Apr 8, 2011 at 4:54 PM, Ian Hickson <ian at hixie.ch> wrote:
> >
> > What is a "main media resource"?
> >
> > e.g. consider youtubedoubler.com; what is the main resource?
> >
> > Or similarly, when watching the director's commentary track on a 
> > movie, is the commentary the main track, or the movie?
> >
> I don't think youtubedoubler.com is the main use case here. In the 
> youtubedoubler.com use case, you have two independent videos that make 
> sense by themselves, but are only coupled together by their timeline.
> The cases that I listed above, audio descriptions, sign language video, 
> and dubbed audio tracks, make no sense by themselves. They are produced 
> with a clear reference to one specific video and its details and could 
> be delivered either as in-band tracks or as external files. From a 
> developer and user point of view - and in analogy to the track element - 
> it makes no sense to regard them as independent media resources. They 
> all refer to a "main" resource - the original video.

I don't know which is the "main use case"; I wouldn't be surprised if 
sites like youtubedoubler.com had as many if not more viewers than those 
with sign language videos. In any case, we have to handle both.

My point was just that there isn't a well-defined "main media resource".

> However, there are more similarities between audio, video and text 
> tracks than one might think.
> For example, it is possible to want to have multiple video tracks and 
> multiple text tracks rendered on top of a single video rendering area, 
> and they may all be explicitly positioned just like positioned captions 
> and they may all need to avoid each other. So, it could make sense to 
> include them all in a single rendering approach.

One could say the same about <div>. It seems like a bit of a superficial 

Similarities between audio and video tracks and text tracks are only 
really interesting here if they're not also similarities that apply to 
other even more unrelated things.

> Another example is that you may have a audio track with different 
> captions to the captions of a related video element. Since the audio 
> track has no visual display, its captions are not rendered, but the 
> video's captions are rendered. Now, how are you going to make its 
> captions available to the video's display area when the linked audio 
> track is activated?

Do you have a concrete example of this? I'm not sure I really follow.

> Some things will inherently be harder by taking the approach of separate 
> video and audio elements rather than the track approach.

I don't really see how this particular example relates to the issue of 
audio/video tracks being treated similarly or differently than text 
tracks. I agree that the described behaviour might need some tweaks to 
handle properly, but I don't think those tweaks would involve making the 
handling of audio/video tracks and text tracks more similar to each other.

> > On Mon, 28 Mar 2011, Silvia Pfeiffer wrote:
> >>
> >> We haven't allowed caption tracks to start with a different 
> >> startTimeOffset than the video, nor are we allowing to give them a 
> >> different playbackRate to the video.
> >
> > It's relatively easy to do it for text tracks: you just take a text 
> > track and recreate it with different timings (something that can be 
> > done in a few lines of JavaScript given the API we expose). So there's 
> > no need for it to be explicit.
> >
> > For synchronising <video> and <audio>, we should expose multiple 
> > tracks starting at different offsets because it is easy to achieve yet 
> > provides numerous opportunities for authors. For example, it's not 
> > uncommon to want to compare two movies which have similar moments; 
> > showing such similarities would require either video editing or, if we 
> > allowed offsets, could be done merely by pointing to two movie files 
> > with appropriate offsets.
> It is not any more difficult to change the startTime of a video element 
> in JavaScript than it is to change the start time of a track resource.
> Also, I believe that your use case can more easily be satisfied with 
> temporal media fragment URIs, which not just get the offset, but the 
> section from start to end that people are comparing.

I don't follow.

However, note that at the moment the MediaController feature doesn't 
support arbitrary offsets of audio/video tracks.

> >> Tracks in a multitrack resource (no matter if in-band or external 
> >> files) are rather tightly authored to cover the exact same timeline 
> >> in my experience.
> >
> > Sure. But it would be silly to only support one use case when with 
> > minimal effort we could support a vastly greater number of use cases, 
> > including many we have not yet considered.
> >
> > This is one of those situations where not supporting something 
> > actually requires more API complexity than supporting it. We are 
> > rarely faced with such an opportunity.
> I don't want to solve use cases that we haven't thought about yet.

I don't want to add features to solve use cases that we haven't thought 
about yet, but it seems pointless to actively prevent use cases we haven't 
thought about yet when supporting them would be free.

(Now in the case of arbitrary offsets, it was indicated that it wouldn't 
be free for implementations, which is why I changed this. But that's a 
separate argument.)

> I want to solve the particular use cases that we are faced with which 
> are concretely audio descriptions, sign language video, and dubbed audio 
> tracks, which are tightly linked to a main resource (i.e. the one that 
> they are describing). The youtubedoubler use case is actually a 
> different one, where we only need to make sure that the elements march 
> to the same clock. They could, however, march in different directions, 
> or be offsetted, where the offset could be changed interactively, and 
> all sorts of other interactive mixing examples (sort-of what a DJ does). 
> I think there is a big difference between the needs of a mixer or 
> editor, and the need of tightly linked multitrack.

Yes. I think the youtubedoubler.com case is likely to end up getting more 
usage, though. I think we should handle all of the above cases, but I 
certainly wouldn't describe the sign language video case as the main use 
case. It's an important use case morally, and so we should handle it, but 
it is probably not an important one numerically, and so we should not 
exclude others that we can address at the same time.

> > For <video> tracks I don't understand how we could do it in practice, 
> > since the UA has no way to know what the page author intends in terms 
> > of video element positioning. I guess we could just have the video 
> > tracks positioned the way that the video stream says they should be 
> > positioned and not allow the tracks to be repositioned. Is that 
> > desireable? What happens if a video with a known position is enabled 
> > while a full-frame video is enabled, and then the full-frame video is 
> > disabled? Should the smaller one full the whole frame? Remain its 
> > size? These questions and others like them are why I've left this 
> > unsupported for now.
> I assume you are talking about in-band video tracks. Might it be 
> possible to create a CSS pseudo-selector that can move the displayed 
> video tracks to other positions on-screen?

I don't see how. Do you have a concrete proposal?

> Why are we looking at audio tracks as being able to have multiple of 
> them active at a time, while video tracks can only have one exclusively 
> active at a time? I don't see why there can't be several video tracks 
> active at the same time, too. This is particularly the case where we 
> have a sign language video overlay.

See above ("For <video> tracks I don't understand...").

On Sun, 10 Apr 2011, Mark Watson wrote:
> In the case of in-band tracks it may still be the case that they are 
> retrieved independently over the network. This could happen two ways:
> - some file formats contain headers which enable precise navigation of 
> the file, for example using HTTP byte ranges, so that the tracks could 
> be retrieved independently. mp4 files would be an example. I don't know 
> that anyone does this, though.
> - in the case of adaptive streaming based on a manifest, the different 
> tracks may be in different files, even though they appear as in-band 
> tracks from an HTML perspective.
> In these cases it *might* make sense to expose separate buffer and 
> network states for the different in-band tracks in just the same way as 
> out-of-band tracks. In fact the distinction between in-band and 
> out-of-band tracks is mainly how you discover them: out-of-band the 
> author is assumed to know about by some means of their own, in-band can 
> be discovered by loading the metadata part of a single initial resource.

>From the API perspective, there's just one resource in these cases. I 
don't think it makes sense to start exposing more detailed data.

On Mon, 11 Apr 2011, Jer Noble wrote:
> The use case for the events is the same one as for the convenience 
> property: without a convenience event, authors would have to add event 
> listeners to every slave media element.  So by "imply", I simply meant 
> that if the use case for the first was compelling enough to warrant new 
> API, the second would be warranted as well.

That's not a use case. By use case I mean something concrete like "the 
YouTube player wants to render a single progress bar for buffering" or 

> Lets say, for example, an author wants to change the color of a play 
> button when the media in a media group all reaches the HAVE_ENOUGH_DATA 
> readyState.

That's almost a use case. "An author wants to change the color of a play 
button when all the media in a media group have been buffered enough that 
the user could watch the entire group all the way through uninterrupted" 
would be a use case.

Anyway this is now possible with the new events.

> >> Again, this would be just a convenience for authors, as this 
> >> information is already available in other forms and could be 
> >> relatively easily calculated on-the-fly in scripts.  But UAs are 
> >> likely going to have do these calculations anyway to support things 
> >> like autoplay, so adding explicit support for them in API form would 
> >> not (imho) be unduly burdensome.
> > 
> > Autoplay is handled without having to do these calculations, as far as 
> > I can tell. I don't see any reason why the UA would need to do these 
> > calculations actually. If there are compelling use cases, though, I'm 
> > happy to add such accessors.
> Well, how exactly is autoplay handled in a media group?  Does the entire 
> media group start playing when the first media element in a group with 
> it's autoplay attribute set reaches HAVE_ENOUGH_DATA?

MediaControllers just always autoplay. They wait for the slaves that 
themselves are marked autoplay to load before they start playing. The 
other slaves just don't play until manually started.

On Mon, 18 Apr 2011, Jeroen Wijering wrote:
> The parallel would be fetching / decoding the tracks but not showing 
> them to the display (video) or speakers (audio). I agree that, 
> implementation wise, this is much less useful than having an "active but 
> hidden" state for text tracks. However, some people might want to 
> manipulate hidden tracks with the audio data API, much like hidden text 
> tracks can be manipulated with javascript.

Having an image or audio or video loaded and decoded but not rendered is 
quite different from having text tracks actually fire events along with 
the timeline, updating the active cues list, etc.

> >> Text tracks are discontinuous units of potentially overlapping 
> >> textual data with position information and other metadata that can be 
> >> styled with CSS and can be mutated from script.
> >> 
> >> Audio and video tracks are continuous streams of immutable media 
> >> data.
> >
> > Video and audio tracks do not necessarily produce continuous output - 
> > it is perfectly legal to have "gaps" in either, eg. segments that do 
> > not render. Both audio and video tracks can have metadata that affect 
> > their rendering: an audio track has a volume metadata that attenuates 
> > its contribution to the overall mix-down, and a video track has matrix 
> > that controls its rendering. The only thing preventing us from styling 
> > a video track with CSS is the lack of definition.
> Yes, and the same (lack of definition) goes for javascript manipulation. 
> It'd be great if we had the tools for manipulating video and audio 
> tracks (extract/insert frames, move audio snippets around). It would 
> make A/V editing - or more creative uses - really easy in HTML5.

That's a use case we should investigate in due course, but I think it's 
probably a bit early to go there.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list