[whatwg] On implementing videos with multiple tracks in HTML5

Sun May 23 04:03:53 PDT 2010

Hi Carlos,

2010/5/23 Carlos Andrés Solís <csolisr at gmail.com>:
> Hello, I've been writing lately in the WHATWG and WebM mail-lists and would
> like to hear your opinion on the following idea.
>
> Imagine a hypothetical website that delivers videos in multiple languages.
> Like on a DVD, where you can choose your audio and subtitles language. And
> also imagine there is the possibility of downloading a file with the video,
> along with either the chosen audio/sub tracks, or all of them at once. Right
> now, though, there's no way to deliver multiple audio and subtitle streams
> on HTML5 and WebM. Since the latter supports only one audio and one video
> track, with no embedded subtitles, creating a file with multiple tracks is
> impossible, unless using full Matroska instead of WebM - save for the fact
> that the standard proposed is WebM and not Matroska.
> A solution could be to stream the full Matroska with all tracks embedded.
> This, though, would be inefficient, since the user often will select only
> one language to view the video, and there's no way yet to stream only the
> selected tracks to the user. I have thought of two solutions for this:
>
> * Solution 1: Server-side demuxing. The video with all tracks is stored as a
> Matroska file. The server demuxes the file, generates a new one with the
> chosen tracks, and streams only the tracks chosen by the user. When the user
> chooses to download the full video, the full Matroska file is downloaded
> with no overhead. The downside is the server-side demuxing and remuxing;
> fortunately most users only need to choose once. Also, there's the problem
> of having to download the full file instead of a file with only the tracks
> wanted; this could be solved by even more muxing.

For the last 10 years, we have tried to solve many of the media
challenges on servers, making servers increasingly intelligent, and by
that slow, and not real HTTP servers any more. Much of that happened
in proprietary software, but others tried it with open software, too.
For example I worked on a project called Annodex which was trying to
make open media resources available on normal HTTP servers with only a
cgi script installed that would allow remuxing files for serving time
segments of the media resources. Or look at any of the open source
RTSP streaming servers that were created.

We have learnt in the last 10 years that the Web is better served with
a plain HTTP server than with custom media servers and we have started
putting the intelligence into user agents instead. User agents now
know how to do byte range requests to retrieve temporal segments of a
media resource. I believe for certain formats it's even possible to
retrieve tracks through byte range requests only.

In short, the biggest problem with your idea of dynamic muxing on a
server is that it's very CPU intensive and doesn't lead easily to a
scalable server. Also, it leads to specialised media servers in
contrast to just using a simple HTTP server. It's possible, of course,
but it's complex and not general-purpose.

> * Solution 2: User-side muxing. Each track (video, audio, subtitles) is
> stored in standalone files. The server streams the tracks chosen by the
> user, and the web browser muxes them back. When the user chooses to download
> the video, the generation of the file can be done either server-side or
> client-side. This can be very dynamic but will force content providers to
> use extra coding inside of the pages.

Again, we've actually tried this over the last 10 years with SMIL.
However, synchronising audio and video that comes from multiple
servers and therefore has different network delays, different
buffering rates, different congestion times, etc. makes it really
difficult to keep multiple media resources in sync.

You don't actually have to rip audio and video apart to achieve what
you're trying to do. Different Websites are created for different
languages, too. So, I would expect that if your Website is in Spanish,
you will get your video with a Spanish audio track, or when it's in
German, your audio will be German. Each one of these is a media
resource with a single audio and a single video track. Yes, your video
track is replicated on the server between these different resources.
But that's probably easier to handle from a production point of view
anyway.

The matter with subtitle / caption tracks is then a separate one. You
could embed all of the subtitle tracks in all the media resources to
make sure that when a file is downloaded, it comes with its
alternative subtitle tracks. That's not actually that huge an
overhead, seeing as text tracks make up the least space compared to
the audio and video data.

Or alternatively you could have the subtitle tracks as extra files.
This is probably the preferred mode of operation and most conformant
with traditional Web principles, seeing as they are text resources and
the best source of information for indexing the content of a media
resource in, e.g. a search engine. Also, such files are much easier to
administrate than if they are inside a media resource - easier to
produce separately from the media resource and add later - easier to
edit post-publishing - and easier to provide from e.g. a database
rather than as an actual file.

It is this latter approach that the new HTML5 <track> element is
pursuing. In this scenario, the Web browser will indeed synchronise
the text with the media resource for playback. It doesn't need to do
muxing for this, since it only needs to display the media resource and
the text in syc, not actually create a new resource. Whether we want
to take the next step and do an actual muxing on the client for a
downloaded media resource with multiple <track> elements is a question
that needs to be discussed. It is indeed a possibility. But it's not
something I'm worried about, since there are tools available for
muxing that I can use if I really wanted to create such a file after
downloading the individual text tracks.

Yet, I have to say that for the situation where we are actually
dealing with multitrack media resources in HTML5, we still haven't got
an interface available. There is a proposal at
http://www.w3.org/WAI/PF/HTML/wiki/Media_MultitrackAPI, which is still
in the W3C bug tracker and will need to get adapted to work with the
new <track> element before being introduced.

Cheers,
Silvia.