[whatwg] On implementing videos with multiple tracks in HTML5

Thu Aug 19 17:23:45 PDT 2010

On Fri, Aug 20, 2010 at 9:58 AM, Ian Hickson <ian at hixie.ch> wrote:

> On Sat, 22 May 2010, Carlos Andrés Solís wrote:
> >
> > Imagine a hypothetical website that delivers videos in multiple
> > languages. Like on a DVD, where you can choose your audio and subtitles
> > language. And also imagine there is the possibility of downloading a
> > file with the video, along with either the chosen audio/sub tracks, or
> > all of them at once. Right now, though, there's no way to deliver
> > multiple audio and subtitle streams on HTML5 and WebM. Since the latter
> > supports only one audio and one video track, with no embedded subtitles,
> > creating a file with multiple tracks is impossible, unless using full
> > Matroska instead of WebM - save for the fact that the standard proposed
> > is WebM and not Matroska.
> >
> > A solution could be to stream the full Matroska with all tracks
> > embedded. This, though, would be inefficient, since the user often will
> > select only one language to view the video, and there's no way yet to
> > stream only the selected tracks to the user. I have thought of two
> > solutions for this:
> >
> > * Solution 1: Server-side demuxing. The video with all tracks is stored
> > as a Matroska file. The server demuxes the file, generates a new one
> > with the chosen tracks, and streams only the tracks chosen by the user.
> > When the user chooses to download the full video, the full Matroska file
> > is downloaded with no overhead. The downside is the server-side demuxing
> > and remuxing; fortunately most users only need to choose once. Also,
> > there's the problem of having to download the full file instead of a
> > file with only the tracks wanted; this could be solved by even more
> > muxing.
>
> On Sun, 23 May 2010, Silvia Pfeiffer wrote:
> >
> > For the last 10 years, we have tried to solve many of the media
> > challenges on servers, making servers increasingly intelligent, and by
> > that slow, and not real HTTP servers any more. Much of that happened in
> > proprietary software, but others tried it with open software, too. For
> > example I worked on a project called Annodex which was trying to make
> > open media resources available on normal HTTP servers with only a cgi
> > script installed that would allow remuxing files for serving time
> > segments of the media resources. Or look at any of the open source RTSP
> > streaming servers that were created.
> >
> > We have learnt in the last 10 years that the Web is better served with a
> > plain HTTP server than with custom media servers and we have started
> > putting the intelligence into user agents instead. User agents now know
> > how to do byte range requests to retrieve temporal segments of a media
> > resource. I believe for certain formats it's even possible to retrieve
> > tracks through byte range requests only.
> >
> > In short, the biggest problem with your idea of dynamic muxing on a
> > server is that it's very CPU intensive and doesn't lead easily to a
> > scalable server. Also, it leads to specialised media servers in contrast
> > to just using a simple HTTP server. It's possible, of course, but it's
> > complex and not general-purpose.
>
> On Mon, 31 May 2010, Lachlan Hunt wrote:
> >
> > WebM, just like Matroska, certainly does support multiple video and
> > audio tracks.  The current limitation is that browser implementations
> > don't yet provide an interface or API for track selection.
> >
> > Whether or not authors would actually do this depends on their use case
> > and what trade offs they're willing to make.  The use cases I'm aware of
> > for multiple tracks include offering stereo and surround sound
> > alternatives, audio descripitons, audio commentaries or multiple
> > languages.
> >
> > The trade off here is in bandwidth usage vs. storage space (or
> > processing time if you're doing dynamic server side muxing). Duplicating
> > the video track in each file, containing only a single audio track saves
> > bandwidth for users while increasing storage space. Storing all audio
> > tracks in one multi-track webm file avoids duplication, while increasing
> > the bandwidth for users downloading tracks they may not need.
> >
> > The latter theoretically allows for the user to dynamically switch audio
> > tracks to, e.g. change language or listen to commentary, without having
> > to download a whole new copy of the video.  The former requires the user
> > to choose which tracks they want prior to downloading the appropriate
> > file.
> >
> > If there's only a choice between 2 or maybe 3 tracks, then the extra
> > bandwidth may be insignificant.  If, however, you're offering several
> > alternate languages in both stereo and surround sound, with audio
> > descriptions and directors commentary — the kind of stuff you'll find
> > on many commercial DVDs — then the extra bandwidth wasted by users
> > downloading so many tracks they don't need may not be worth it.
>
> On Sat, 22 May 2010, Carlos Andrés Solís wrote:
> >
> > * Solution 2: User-side muxing. Each track (video, audio, subtitles) is
> > stored in standalone files. The server streams the tracks chosen by the
> > user, and the web browser muxes them back. When the user chooses to
> > download the video, the generation of the file can be done either
> > server-side or client-side. This can be very dynamic but will force
> > content providers to use extra coding inside of the pages.
>
> On Sun, 23 May 2010, Silvia Pfeiffer wrote:
> >
> > Again, we've actually tried this over the last 10 years with SMIL.
> > However, synchronising audio and video that comes from multiple
> > servers and therefore has different network delays, different
> > buffering rates, different congestion times, etc. makes it really
> > difficult to keep multiple media resources in sync.
> >
> > You don't actually have to rip audio and video apart to achieve what
> > you're trying to do. Different Websites are created for different
> > languages, too. So, I would expect that if your Website is in Spanish,
> > you will get your video with a Spanish audio track, or when it's in
> > German, your audio will be German. Each one of these is a media
> > resource with a single audio and a single video track. Yes, your video
> > track is replicated on the server between these different resources.
> > But that's probably easier to handle from a production point of view
> > anyway.
>
> Silvia's comments pretty much parallel my own understanding of this
> situation (maybe because much of my understanding comes from Silvia
> educating me on these topics!).
>
> The long and short of it is that it's probably too early to add more
> features along these lines to HTML. As Silvia points out, we haven't even
> solved the comparatively simple problem of localising a Web page. It may
> be that we don't need to.
>
>
> On Sun, 23 May 2010, Silvia Pfeiffer wrote:
> >
> > The matter with subtitle / caption tracks is then a separate one. You
> > could embed all of the subtitle tracks in all the media resources to
> > make sure that when a file is downloaded, it comes with its
> > alternative subtitle tracks. That's not actually that huge an
> > overhead, seeing as text tracks make up the least space compared to
> > the audio and video data.
> >
> > Or alternatively you could have the subtitle tracks as extra files.
> > This is probably the preferred mode of operation and most conformant
> > with traditional Web principles, seeing as they are text resources and
> > the best source of information for indexing the content of a media
> > resource in, e.g. a search engine. Also, such files are much easier to
> > administrate than if they are inside a media resource - easier to
> > produce separately from the media resource and add later - easier to
> > edit post-publishing - and easier to provide from e.g. a database
> > rather than as an actual file.
> >
> > It is this latter approach that the new HTML5 <track> element is
> > pursuing. In this scenario, the Web browser will indeed synchronise
> > the text with the media resource for playback. It doesn't need to do
> > muxing for this, since it only needs to display the media resource and
> > the text in syc, not actually create a new resource. Whether we want
> > to take the next step and do an actual muxing on the client for a
> > downloaded media resource with multiple <track> elements is a question
> > that needs to be discussed. It is indeed a possibility. But it's not
> > something I'm worried about, since there are tools available for
> > muxing that I can use if I really wanted to create such a file after
> > downloading the individual text tracks.
>
> Yeah, people definitely want the ability to have external text timed
> tracks.
>
>
Three issues I have taken out of this discussion that I think are still open
to discuss and potentially define in the spec:

* How to expose in-band extra audio and video tracks from a multi-track
media resource to the Web browser? I am particularly thinking here about the
use cases Lachlan mentioned: offering stereo and surround sound
alternatives, audio descriptions, audio commentaries or multiple languages,
and would like to add sign language tracks to this list. This is important
to solve now, since it will allow the use of audio descriptions and sign
language, two important accessibility requirements.

* How to associate and expose such extra audio and video tracks that are
provided out-of-band to the Web browser? This is probably a next-version
issue since it's rather difficult to implement in the browser. It improves
on meeting accessibility needs, but it doesn't stand in the way of providing
audio descriptions and sign language - just makes it easier to use them.

* Whether to include a multiplexed download functionality in browsers for
media resources, where the browser would do the multiplexing of the active
media resource with all the active text, audio and video tracks? This could
be a context menu functionality, so is probably not so much a need to
include in the HTML5 spec, but it's something that browsers can consider to
provide. And since muxing isn't quite as difficult a functionality as e.g.
decoding video, it could actually be fairly cheap to implement.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100820/47c97837/attachment-0002.htm>