[whatwg] How to handle multitrack media resources in HTML

Jeroen Wijering jeroen at longtailvideo.com
Mon Feb 14 00:13:04 PST 2011

Hello Silvia, all,

First, thanks for the Multitrack wiki page. Very helpful for those who are not subscribed to the various lists. I also phrased below comments as feedback to this page:



The use case is spot on; this is an issue that blocks HTML5 video from being chosen over a solution like Flash. An elaborate list of tracks is important, to correctly scope the conditions / resolutions:

1. Tracks targeting device capabilities:
   * Different containers / codes / profiles
   * Multiview (3D) or surround sound
   * Playback rights and/or decryption possibilities
2. Tracks targeting content customization:
   * Alternate viewing angles or alternate music scores
   * Director's comments or storyboard video
3. Tracks targeting accessibility:
   * Dubbed audio or text subtitles
   * Audio descriptions or closed captions
   * Tracks cleared from cursing / nudity / violence
4. Tracks targeting the interface:
   * Chapterlists, bookmarks, timed annotations, midroll hints..
   * .. and any other type of scripting queues

Note I included the HTML5 "text tracks". I believe there are four kinds of tracks, all inherent part of a media presentation. These types designate the output of the track, not its encoded representation:

* audio (producing sound)
* metadata (producing scripting queues)
* text (producing rendered text)
* video (producing images)

In this taxonomy, the HTML5 "subtitles" and "captions" <track> kinds are text, the "descriptions" kind is audio and the "chapters" and "metadata" kinds are metadata.


The requirements are elaborate, but do note they span beyond HTML5. Everything that plays back audio/video needs multitrack support:

* Broad- and narrowcasting playback devices of any kind
* Native desktop, mobile and settop applications/apps
* Devices that play media standalone (mediaplayers, pictureframes, "airplay")

Also, on e.g. the iPhone and Android devices, playback of video is triggered by HTML5, but subsequently detached from it. Think about the custom fullscreen controls, the obscuring of all HTML and events/cueues that are deliberately ignored or not sent (such as play() in iOS). I wonder whether this is a temporary state or something that will remain and  should be provisioned. 

With this in mind, I think an additional requirement is that there should be a full solution outside the scope of HTML5. HTML5 has unique capabilities like customization of the layout (CSS) and interaction (JavaScript), but it must not be required.


In the side conditions, I'm not sure on the relative volume of audio or positioning of video. Automation by default might work better and requires no parameters. For audio, blending can be done through a ducking mechanism (like the JW Player does). For video, blending can be done through an alpha channel. At a later stage, an API/heuristics for PIP support and gain control can be added.


In terms of solutions, I lean much towards the manifest approach. The other approaches are options that each add more elements to HTML5, which:

* Won't work for situations outside of HTML5.
* Postpone, and perhaps clash with, the addition of manifests.

Without a manifest, there'll probably be no adaptive streaming, which renders HTML5 video much less useful. At the same time, standardization around manifests (DASH) is largely wrapping up.


Here's some code on the manifest approach. First the HTML5 side:

<video id="v1" poster="video.png" controls>
  <source src="manifest.xml" type="video/mpeg-dash">

Second the manifest side:

<MPD mediaPresentationDuration="PT645S" type="OnDemand">

        <Group mimeType="video/webm"  lang="en">
            <Representation sourceURL="video-1600.webm" />

        <Group mimeType="video/mp4; codecs=avc1.42E00C,mp4a.40.2" lang="en">
            <Representation sourceURL="video-1600.mp4" />

        <Group mimeType="text/vvt" lang="en">
            <Accessibility type="CC" />
            <Representation sourceURL="captions.vtt" />


(I should more look into accessibility parameters, but there is support for signalling captions, audiodescriptions, sign language etc.)

Note that this approach moves the text track outside of HTML5, making it accessible for other clients as well. Both codecs are also in the manifest - this is just one of the device capability selectors of DASH clients.


The two listed disadvantages for the "manifest approach" in the wiki page are lack of CSS and discoverability:

*) The CSS styling issue can be fixed by making a conceptual change to CSS and text tracks. Instead of styling text tracks, a single "text rendering area" for each video element can be exposed and styled. Any text tracks that are enabled push data in it, which is automatically styled according to the video.textStyle/etc rules.

*) Discoverability is indeed an issue, but this can be fixed by defining a common track API for signalling and enabling/disabling tracks:

interface Track {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
  attribute unsigned short mode;

interface HTMLMediaElement : HTMLElement {
  readonly attribute Track[] tracks;

Kind regards,


On Feb 10, 2011, at 2:07 AM, silviapfeiffer1 at gmail.com wrote:

> Hi all,
> One particular issue that hasn't had much discussion here yet is the
> issue of how to deal with multitrack media resources or media
> resources that have associated synchronized audio and video resources.
> I'm concretely referring to such things as audio descriptions, sign
> language video, and dubbed audio tracks.
> We require an API that can expose such extra tracks to the user and to
> JavaScript. This should be independent of whether the tracks are
> actually inside the media resource or are given as separate resources,
> but should be linked to the main media resource through markup.
> I am bringing this up now because solutions may have an influence on
> the inner workings of TimedTrack and the <track> element, so before we
> have any implementations of <track>, we should be very certain that we
> are happy with the way in which it works - in particular that <track>
> continues to stay an empty element.
> We've had some preliminary discussions about this in the W3C
> Accessibility Task Force and the alternatives that we could think
> about are captured in
> http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API . This
> may not be the complete list of possible solutions, but it provides
> ideas for the different approaches that can be taken.
> I'd like to see what people's opinions are about them.
> Note there are also discussion threads about this at the W3C both in
> the Accessibility TF [1] and the HTML Working Group [2], but I am
> curious about input from the wider community.
> So check out http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API
> and share your opinions.
> Cheers,
> Silvia.
> [1] http://lists.w3.org/Archives/Public/public-html-a11y/2011Feb/0057.html
> [2] http://lists.w3.org/Archives/Public/public-html/2011Feb/0205.html

More information about the whatwg mailing list