[whatwg] How to handle multitrack media resources in HTML

Mon Apr 11 17:26:37 PDT 2011

On Fri, 8 Apr 2011, Jer Noble wrote:
> On Apr 7, 2011, at 11:54 PM, Ian Hickson wrote:
> >> 
> >> The distinction between a master media element and a master media 
> >> controller is, in my mind, mostly a distinction without a difference.  
> >> However, a welcome addition to the media controller would be 
> >> convenience APIs for the above properties (as well as playbackState, 
> >> networkState, seekable, and buffered).
> > 
> > I'm not sure what networkState in this context. playbackState, 
> > assuming you mean 'paused', is already exposed.
> 
> Sorry, by playbackState, I meant readyState.  And I was suggesting that, 
> much in the same way that you've provided .buffered and .seekable 
> properties which "expose the intersection of the slaved media elements' 
> corresponding ranges", that a readyState property could similarly 
> reflect the readyState values of all the slaved media elements. In this 
> case, the MediaController's hypothetical readyState wouldn't flip to 
> HAVE_ENOUGH_DATA until all the constituent media element's ready states 
> reached at least the same value.

So basically it would return the lowest possible value amongst the slaved 
elements? I guess we could expose such a convenience accessor, but what's 
the use case? It seems easy enough to implement manually in JS, so unless 
there's a compelling case, I'd be reluctant to add it.

> Of course, this would imply that the load events fired by a media 
> element (e.g. loadedmetadata, canplaythrough) were also fired by the 
> MediaController, and I would support this change as well.

I don't see why it would imply that, but certainly we could add events 
like that to the controller. Again though, what's the use case?

> Again, this would be just a convenience for authors, as this information 
> is already available in other forms and could be relatively easily 
> calculated on-the-fly in scripts.  But UAs are likely going to have do 
> these calculations anyway to support things like autoplay, so adding 
> explicit support for them in API form would not (imho) be unduly 
> burdensome.

Autoplay is handled without having to do these calculations, as far as I 
can tell. I don't see any reason why the UA would need to do these 
calculations actually. If there are compelling use cases, though, I'm 
happy to add such accessors.

On Fri, 8 Apr 2011, Eric Winkelman wrote:
> On Friday, April 08, 2011, Ian Hickson wrote:
> > On Thu, 17 Feb 2011, Eric Winkelman wrote:
> > >
> > > MPEG transport streams, as used for commercial TV, will often 
> > > contain multiple types of metadata: content advisories, ad insertion 
> > > opportunities, interactive TV application triggers, etc.  If we were 
> > > getting this information out-of-band we would, as you suggest, know 
> > > how to deal with it.  We would use multiple @kind=metadata tracks, 
> > > with the correct handler associated with each track.  In our case, 
> > > however, this information is all coming in-band.
> > >
> > > There is information within the MPEG transport stream that 
> > > identifies the types of metadata being carried.  This lets the video 
> > > player know, for example, that the stream has a particular track 
> > > with application triggers, and another one with content advisories.  
> > > To be consistent with the out-of-band tracks, we envision the player 
> > > creating separate TimedTrack elements for each type of metadata, and 
> > > adding the associated data as cues.  But there isn't a clear way for 
> > > the player to indicate the type of metadata it's putting into each 
> > > of these TimedTrack cues.
> > >
> > > Which brings us to the mime types.  I have an event handler on the 
> > > <video> tag that fires when the player creates a new metadata track, 
> > > and this handler tries to figure out what to do with the track.  
> > > Without a type on the track, I have to set another handler on the 
> > > track that fires when the player creates a cue, and tries to figure 
> > > out what to do from the cue.  As there is no type on the cue either, 
> > > I have to examine the cue location/text to see if it contains 
> > > metadata I'm able to handle.
> > >
> > > This all works, but it requires event handlers on tracks that may 
> > > have no interest to the application.  On the player side, it depends 
> > > on the player tagging the metadata in a consistent ad-hoc way, as 
> > > well as requiring the player to create separate metadata tracks.  
> > > (We also considered starting the cue's text with a mime type, but 
> > > this has the same basic issues.)
> > 
> > This is an interesting problem.
> > 
> > What is the way that the MPEG streams identify these various metadata 
> > streams? Is it a MIME type? Some other identifier? Is this identifier 
> > separate from the track's label, or is it the track's label?
> 
> The streams contain a Program Map Table (PMT) which contains a list of 
> tuples (program id (PID) and a standard numeric "type") for the 
> program's tracks. This is how the user agent knows about this metadata 
> and what is contained in it. We're envisioning that the combination of 
> transport, e.g. MPEG-2 TS, and PMT "type" would be used by the UA to 
> select a MIME type. We're proposing that this MIME type would be the 
> track's "label". We think it would be better if there were a "type" 
> attribute for the track to use instead of the "label", but using the 
> "label" would work.

It sounds like we need some implementation experience before we can 
specify this fully. I would recommend implementing this with a vendor 
prefix and testing it out with real content. If it works well, it seems 
like something we should add.

On Mon, 11 Apr 2011, Jeroen Wijering wrote:
> On Apr 8, 2011, at 8:54 AM, Ian Hickson wrote:
> >> 
> >> *) Discoverability is indeed an issue, but this can be fixed by defining 
> >> a common track API for signalling and enabling/disabling tracks:
> >> 
> >> {{{
> >> interface Track {
> >>  readonly attribute DOMString kind;
> >>  readonly attribute DOMString label;
> >>  readonly attribute DOMString language;
> >> 
> >>  const unsigned short OFF = 0;
> >>  const unsigned short HIDDEN = 1;
> >>  const unsigned short SHOWING = 2;
> >>  attribute unsigned short mode;
> >> };
> >> 
> >> interface HTMLMediaElement : HTMLElement {
> >>  [...]
> >>  readonly attribute Track[] tracks;
> >> };
> >> }}}
> > 
> > There's a big difference between text tracks, audio tracks, and video 
> > tracks. While it makes sense, for instance, to have text tracks 
> > enabled but not showing, it makes no sense to do that with audio 
> > tracks.
> 
> Audio and video tracks require more data, hence it's less preferred to 
> allow them being enabled but not showing. If data wasn't an issue, it 
> would be great if this were possible; it'd allow instant switching 
> between multiple audio dubs, or camera angles.

I think we mean different things by "active" here.

The "hidden" state for a text track is one where the UA isn't rendering 
the track but the UA is still firing all the events and so forth. I don't 
understand what the parallel would be for a video or audio track.

Text tracks are discontinuous units of potentially overlapping textual 
data with position information and other metadata that can be styled with 
CSS and can be mutated from script.

Audio and video tracks are continuous streams of immutable media data.

I don't really see what they have in common other than us using the word 
"track" to refer to both of them, and that's mostly just an artefact of 
the language.

> In terms of the data model, I don't believe there's major differences 
> between audio, text or video tracks. They all exist at the same level - 
> one down from the main presentation layer. Toggling versus layering can 
> be an option for all three kinds of tracks.

I really don't see how they can be considered similar at all, let alone 
that they have no major differences.

> For example, multiple video tracks can be mixed together in one media 
> element's display. Think about PiP, perspective side by side (Stevenote 
> style) or a 3D grid (group chat, like Skype). Perhaps this should be 
> supported instead of relying upon multiple video elements, manual 
> positioning and APIs to knit things together.

<div>s can be mixed in the same way, as can <input type=range>, and pretty 
much everything else, but that doesn't mean they're all the same...

On Mon, 11 Apr 2011, Jeroen Wijering wrote:
> On Apr 8, 2011, at 8:54 AM, Ian Hickson wrote:
> 
> >> but should be linked to the main media resource through markup.
> > 
> > What is a "main media resource"?
> > 
> > e.g. consider youtubedoubler.com; what is the main resource?
> > 
> > Or similarly, when watching the director's commentary track on a 
> > movie, is the commentary the main track, or the movie?
> 
> In systems like MPEG TS and DASH, there's the notion of the "system 
> clock". This is the overarching resource to which all audio, meta, text 
> and video tracks are synced. The clock has no video frames or audio 
> samples by itself, it just acts as the wardrobe for all tracks. Perhaps 
> it's worth investigating if this would be useful for media elements?

That's pretty much exactly what the MediaController design does.

On Mon, 11 Apr 2011, Eric Carlson wrote:
> On Apr 10, 2011, at 12:36 PM, Mark Watson wrote:
> 
> > In the case of in-band tracks it may still be the case that they are 
> > retrieved independently over the network. This could happen two ways: 
> > - some file formats contain headers which enable precise navigation of 
> > the file, for example using HTTP byte ranges, so that the tracks could 
> > be retrieved independently. mp4 files would be an example. I don't 
> > know that anyone does this, though.
> 
> QuickTime has supported tracks with external media samples in .mov files 
> for more than 15 years. This type of file is most commonly used during 
> editing, but they are occasionally found on the net.
> 
> 
> > - in the case of adaptive streaming based on a manifest, the different 
> > tracks may be in different files, even though they appear as in-band 
> > tracks from an HTML perspective.
> > 
> > In these cases it *might* make sense to expose separate buffer and 
> > network states for the different in-band tracks in just the same way 
> > as out-of-band tracks.
> 
>   I strongly disagree. Having different tracks APIs for different 
> container formats will be extremely confusing for developers, and I 
> don't think it will add anything. A UA that chooses to support non-self 
> contained media files should account for all samples when reporting 
> readyState and networkState.

Agreed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'