[whatwg] How to handle multitrack media resources in HTML

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Sun Apr 10 00:44:35 PDT 2011

On Fri, Apr 8, 2011 at 4:54 PM, Ian Hickson <ian at hixie.ch> wrote:
> On Thu, 10 Feb 2011, Silvia Pfeiffer wrote:
>> One particular issue that hasn't had much discussion here yet is the
>> issue of how to deal with multitrack media resources or media resources
>> that have associated synchronized audio and video resources. I'm
>> concretely referring to such things as audio descriptions, sign language
>> video, and dubbed audio tracks.
>> We require an API that can expose such extra tracks to the user and to
>> JavaScript. This should be independent of whether the tracks are
>> actually inside the media resource or are given as separate resources,
> I think there's a big difference between multiple tracks inside one
> resource and multiple tracks spread amongst multiple resources: in the
> former case, one would need a single set of network state APIs (load
> algorithm, ready state, network state, dimensions, buffering state, etc),
> whereas in the second case we'd need N set of these APIs, one for each
> media resource.
> Given that the current mechanism for exposing the load state of a media
> resource is a media element (<video>, <audio>), I think it makes sense to
> reuse these elements for loading each media resource even in a multitrack
> scenario. Thus I do not necessarily agree that exposing extra tracks
> should be done in a way that as independent of whether the tracks are
> in-band or out-of-band.
>> but should be linked to the main media resource through markup.
> What is a "main media resource"?
> e.g. consider youtubedoubler.com; what is the main resource?
> Or similarly, when watching the director's commentary track on a movie, is
> the commentary the main track, or the movie?
>> I am bringing this up now because solutions may have an influence on the
>> inner workings of TimedTrack and the <track> element, so before we have
>> any implementations of <track>, we should be very certain that we are
>> happy with the way in which it works - in particular that <track>
>> continues to stay an empty element.
> I don't really see why this would be related to text tracks. Those have
> their own status framework, and interact directly with a media element.
> Looking again at the youtubedoubler.com example, one could envisage both
> sides having text tracks. They wouldn't be joint tracks.

I don't think youtubedoubler.com is the main use case here. In the
youtubedoubler.com use case, you have two independent videos that make
sense by themselves, but are only coupled together by their timeline.
The cases that I listed above, audio descriptions, sign language
video, and dubbed audio tracks, make no sense by themselves. They are
produced with a clear reference to one specific video and its details
and could be delivered either as in-band tracks or as external files.
>From a developer and user point of view - and in analogy to the track
element - it makes no sense to regard them as independent media
resources. They all refer to a "main" resource - the original video.

> On Mon, 14 Feb 2011, Jeroen Wijering wrote:
>> In terms of solutions, I lean much towards the manifest approach. The
>> other approaches are options that each add more elements to HTML5,
>> which:
>> * Won't work for situations outside of HTML5.
>> * Postpone, and perhaps clash with, the addition of manifests.
> Manifests, and indeed any solution that relies on a single media element,
> would make it very difficult to render multiple video tracks independently
> (e.g. side by side vs picture-in-picture). That's not to say that
> manifests shouldn't work, but I think we'd need another solution as well.
>> *) The CSS styling issue can be fixed by making a conceptual change to
>> CSS and text tracks. Instead of styling text tracks, a single "text
>> rendering area" for each video element can be exposed and styled. Any
>> text tracks that are enabled push data in it, which is automatically
>> styled according to the video.textStyle/etc rules.
> This wouldn't work well with positioned captions.
>> *) Discoverability is indeed an issue, but this can be fixed by defining
>> a common track API for signalling and enabling/disabling tracks:
>> {{{
>> interface Track {
>>   readonly attribute DOMString kind;
>>   readonly attribute DOMString label;
>>   readonly attribute DOMString language;
>>   const unsigned short OFF = 0;
>>   const unsigned short HIDDEN = 1;
>>   const unsigned short SHOWING = 2;
>>   attribute unsigned short mode;
>> };
>> interface HTMLMediaElement : HTMLElement {
>>   [...]
>>   readonly attribute Track[] tracks;
>> };
>> }}}
> There's a big difference between text tracks, audio tracks, and video
> tracks. While it makes sense, for instance, to have text tracks enabled
> but not showing, it makes no sense to do that with audio tracks.
> Similarly, video tracks need their own display area, but text tracks need
> a video track's display area. A single video area can display one video
> (multiple overlapping videos being achieved by multiple playback areas),
> but multiple audio and text tracks can be mixed together without any
> difficulty (mixing in one audio channel, or positioning over one video
> display area, respectively).
> So I'm not sure a single "tracks" API makes sense.

We have experimented with the idea of a single "tracks" API at a
recent F2F of the W3C HTML accessibility task force and it does indeed
become very complex because of the replication of states between the
tracks and the MediaElement, see
. We're basically re-introducing everything for the track that we
already have for the MediaElement.

However, there are more similarities between audio, video and text
tracks than one might think.

For example, it is possible to want to have multiple video tracks and
multiple text tracks rendered on top of a single video rendering area,
and they may all be explicitly positioned just like positioned
captions and they may all need to avoid each other. So, it could make
sense to include them all in a single rendering approach.

Another example is that you may have a audio track with different
captions to the captions of a related video element. Since the audio
track has no visual display, its captions are not rendered, but the
video's captions are rendered. Now, how are you going to make its
captions available to the video's display area when the linked audio
track is activated? Some things will inherently be harder by taking
the approach of separate video and audio elements rather than the
track approach.

> On Mon, 28 Mar 2011, Silvia Pfeiffer wrote:
>> We haven't allowed caption tracks to start with a different
>> startTimeOffset than the video, nor are we allowing to give them a
>> different playbackRate to the video.
> It's relatively easy to do it for text tracks: you just take a text track
> and recreate it with different timings (something that can be done in a
> few lines of JavaScript given the API we expose). So there's no need for
> it to be explicit.
> For synchronising <video> and <audio>, we should expose multiple tracks
> starting at different offsets because it is easy to achieve yet provides
> numerous opportunities for authors. For example, it's not uncommon to want
> to compare two movies which have similar moments; showing such
> similarities would require either video editing or, if we allowed offsets,
> could be done merely by pointing to two movie files with appropriate
> offsets.

It is not any more difficult to change the startTime of a video
element in JavaScript than it is to change the start time of a track

Also, I believe that your use case can more easily be satisfied with
temporal media fragment URIs, which not just get the offset, but the
section from start to end that people are comparing.

>> Tracks in a multitrack resource (no matter if in-band or external files)
>> are rather tightly authored to cover the exact same timeline in my
>> experience.
> Sure. But it would be silly to only support one use case when with minimal
> effort we could support a vastly greater number of use cases, including
> many we have not yet considered.
> This is one of those situations where not supporting something actually
> requires more API complexity than supporting it. We are rarely faced with
> such an opportunity.

I don't want to solve use cases that we haven't thought about yet. I
want to solve the particular use cases that we are faced with which
are concretely audio descriptions, sign language video, and dubbed
audio tracks, which are tightly linked to a main resource (i.e. the
one that they are describing). The youtubedoubler use case is actually
a different one, where we only need to make sure that the elements
march to the same clock. They could, however, march in different
directions, or be offsetted, where the offset could be changed
interactively, and all sorts of other interactive mixing examples
(sort-of what a DJ does). I think there is a big difference between
the needs of a mixer or editor, and the need of tightly linked

In tightly linked multitrack, there is the concept of a single entity
and all the elements follow the same current time, playback rate and
direction, seeking, and looping behaviour. The individual tracks don't
break out from the group. This has the advantage that they are all
predictable and can be displayed together. They could even be
displayed with the same controls that cover them all.

However, I can see that the current controller proposal is already
including most of the sync behaviour, in particular a common
currentTime, duration, paused state, playback rate, muted and volume,
so I think we have already moved to a more tightly linked model. I can
see reasons for not going any tighter (such as a track-based approach
would do), because it replicates MediaElement attributes to different
track slaves, while the controller based approach replicates it only
to a single master.

If the controller now defined a rendering area where all the slaves
could be arranged automatically with a single set of controls, that
would be optimal in my eyes. But I can't think of a way in which to do
this elegantly, without meddling with CSS, so I'm happy with the
current approach.

> On Tue, 29 Mar 2011, Silvia Pfeiffer wrote:
>> Independent of the solution that we choose, we have to define what the
>> common timeline is for the combined resource.
> I assume here you mean that we have to expose a "currentTime" for the
> MediaController, and/or use the same clock for all combined tracks.
>> I think we should probably go with the mental model of what it would be
>> when it was really all encapsulated in a single resource. Thus, if a
>> slave resource is longer than the main resource, it actually changes the
>> duration of the combined resource. Thus, we really should have a model
>> for that duration. Shorter is easier to deal with since you can just
>> pretend it is a transparent video or silent audio where it lacks
>> duration.
> I don't think it makes sense to think of a resource's length being
> changed by another resource. I agree that it makes sense to expose the
> overall length, but that doesn't affect the length of individual tracks in
> the group.

I wasn't actually talking about explicit changes to a resource's
durations. I was talking about the visual effect that a shorter
resource would expose.

>> Also, independent of the model, we have to have a common understanding
>> if the currentTime and thus a combined transport bar. By default it
>> makes sense to display that combined transport bar so the user has a
>> means to interact with the multitrack resource.
> I've updated the spec to expose the total duration of the tracks based on
> the currently slaved media elements, and the current position relative to
> that total duration.

Sounds good.

> One difficulty with this is how to deal with looping tracks. I'm not sure
> what the right solution is for that. I see several obvious options; there
> are others too:
>  * Have looping tracks shorter than the longest be repeated (a "fill"
>   approach) -- but then how do you deal with the longest track repeating
>   if it's not a multiple of the shorter tracks?
>  * Have looping happen only when all the tracks have reached the end.

Since that's what happens with in-band multitrack resources, I would
expect that to also happen with composed multitrack resources.

>  * Ignore looping when you're synchronised.
> For now I've gone with the last one (ignore looping), mostly because I
> don't see high-priority and compelling use cases for it.

I'm not a big fan of the @loop attribute in general. :-)

>> > But we can fix that in a later version. It's much harder to fix in the
>> > case of one media element being promoted to a different state than the
>> > others, since we already have defined what the media element API
>> > does.)
>> One thing that I would really like to see is a common menu for turning
>> on and off tracks. This is particularly important if you have audio
>> description tracks, so a blind user can immediately find out if such a
>> track is available and activate it.
> The spec supports turning on and off in-band audio tracks from the UA UI;
> this gets exposed in the audioTracks attribute. I've added an event that
> fires on that object when the set of tracks is changed.
> For out-of-band video tracks, the user can pause and play any track that
> has an <audio> element exposing controls, so there's no need for a
> separate menu. However, to make this easier I've changed the spec to say
> that if there is a menu of audio resources, it should also include the
> audio tracks from any other media elements, possibly defaulting to using
> the name given in the "title" track.

That sounds good.

> For <video> tracks I don't understand how we could do it in practice,
> since the UA has no way to know what the page author intends in terms of
> video element positioning. I guess we could just have the video tracks
> positioned the way that the video stream says they should be positioned
> and not allow the tracks to be repositioned. Is that desireable? What
> happens if a video with a known position is enabled while a full-frame
> video is enabled, and then the full-frame video is disabled? Should the
> smaller one full the whole frame? Remain its size? These questions and
> others like them are why I've left this unsupported for now.

I assume you are talking about in-band video tracks. Might it be
possible to create a CSS pseudo-selector that can move the displayed
video tracks to other positions on-screen?

> On Wed, 30 Mar 2011, Philip Jägenstedt wrote:
>> Having in-band tracks change between being in sync (same offset and
>> rate) and being out of sync (different offset or rate) would be a major
>> head-ache.
> Originally, the tracks could be offset because their .currentTime
> attributes were advanced at a fixed rate, and the MediaController didn't
> have any concept of the currentTime, so just changing the currentTime of
> a media element offset the video by the difference between the old and
> new values.
> Now that we have an overall .duration, it becomes kind of weird that you
> can change the currentTime of each video in turn, and when you change the
> first one, the controller's "duration" changes, and then suddenly when you
> change the last slaved media elements's currentTime, the duration changes
> back.
> On the other hand it's even more weird to have a mutable attribute which
> you can change (the .currentTime on each track), but where the change has
> no effect. And it seems just as confusing to have multiple attributes
> (.currentTime on each track) where when you set one, it resets the others.
> But we presumably do want the attribute to return a useful value, since
> that's an easy way to tell where a track is relative to its own starting
> and ending points.
> Let's consider these options explicitly:
>  1. Setting .currentTime on a media element is ignored or throws an
>    exception when there's a controller.
>  2. Setting .currentTime on a media element with a controller changes
>    the .currentTime on the controller.
>  3. Setting .currentTime on a media element shifts the playback position
>    on that controller, shifting the alignment of this track against the
>    other tracks.
> Option 3 seems the most intuitive from a "this API actually does
> something" point of view, but it raises questions of its own:
> First, what happens when two elements that are not synchronised get
> synchronised? Should they keep playing where they are currently playing,
> or should we snap the playback position somehow? Should the order of
> adding tracks matter?
> There is one reason to prefer the otherwise unintuitive snapping behaviour
> (i.e. resetting .currentTime on the track when you join it to other
> tracks), namely that if someone misuses the API by waiting for the tracks
> to load and start playing and only _then_ syncs them together, they'll be
> slightly offset from each other leading to lip sync issues and the like.
> Second, what happens when you change the .currentTime of a track that
> currently is not playing because the .currentTime of the MediaController
> is out of the range of playable content for the track? If we have a model
> where setting .currentTime introduces an offset, then it means you can
> only set an offset while the track's side of that offset is within the
> currently playable region, you couldn't for example set an offset of two
> seconds on a track that was a minute shorter than another if the Media-
> Controller was at that point playing back the content at the end of the
> longer track, since currentTime can't go beyond duration.
> These questions suggest that actually using the API that option 3 would
> give us would be quite frustrating, which is never a good sign.
> Even after staring at a whiteboard for a while trying to work out how to
> make this API work, I really don't see a good way to make currentTime
> across the MediaController and the media elements work.
> For now I've gone with making .currentTime on media elements throw an
> exception when they're controlled by a controller. This is by no means
> ideal, but it has the advantage of "front-loading" all the unexpected
> behaviour: as soon as someone tries to implement a seek bar on top of this
> they'll find the problems, rather than the confusing behaviour being
> pushed to the edge cases.

I like this approach. It also matches what I expect from tightly
linked elements.

> Finally, one issue that was raised on IRC but not mentioned in e-mail is
> that the design decision of going with a MediaController object rather
> than overloading HTMLMediaElement to handle both the slave and master
> modes means that one can't reuse existing video UI scripts in a
> multi-track environment.
> I looked at this, but I couldn't find a good way to make such scripts work
> with the master/slave overloading case either. There would always be some
> edge case that breaks, e.g. if some slave track is longer than the master,
> or if the whole thing stalls because of a slave, or if the tracks are
> various infinite streams with slightly different initial offsets, etc.
> I have, however, made sure that the MediaController API is similar enough
> to the media element API so that the adjustments one would have to make
> should be pretty minimal.

On a related topic, I have an issue with the way in which in-band
media tracks are exposed in

Why are we looking at audio tracks as being able to have multiple of
them active at a time, while video tracks can only have one
exclusively active at a time? I don't see why there can't be several
video tracks active at the same time, too. This is particularly the
case where we have a sign language video overlay.


More information about the whatwg mailing list