[whatwg] Timed tracks for <video>

Fri Jul 23 06:54:06 PDT 2010

On Jul 23, 2010, at 08:40, Ian Hickson wrote:

> I recently added to the HTML spec a mechanism by which external subtitles 
> and captions can be added to videos in HTML.

Thanks! I like most parts of the new mechanism. I'm commenting just on what I think should be changed, but please don't read this as being overall negative.

> - A set of rules and processing models to hold it all together.

Is it intentional that WebSRT doesn't come with any examples?

> - Keep implementation costs for standalone players low.

I think this should be a non-goal. It seems to me that trying to cater for non-browser user agents or non-Web uses in Web specs leads to bad Web specs. I think by optimizing for standalone players WebSRT falls into one of the common traps for Web specs. I think we should design for the Web (where the rendering is done by browser engines).

> - Use existing technologies where appropriate.
[...]
> - Try as much as possible to have things Just Work.

I think by specifying a standalone cue text parser WebSRT fails on these counts compared to reusing the HTML fragment parsing algorithm for parsing cue text. Specifying a new parser for turning HTML-like tags into a tree structure that can be used as the input of a CSS formatter fails to reuse existing technologies where appropriate (though obviously we disagree on what's "appropriate"). Supporting only some tags and failing to support <font color> from existing .srt fails on "Just Work" in two ways: Existing .srt doesn't Just Work and trying stuff that one would expect to work given that the markup looks like HTML doesn't "Just Work".

> I first researched (with some help from various other contributors - 
> thanks!) what kinds of timed tracks were common. The main classes of use 
> cases I tried to handle were plain text subtitles (translations) and 
> captions (transcriptions) with minimal inline formatting and karaoke 
> support, chapter markers so that browsers could provide quick jumps to 
> points in the video, text-driven audio descriptions, and application- 
> specific timed data.

Why karaoke and application-specific data? Those both seem like feature creep compared to the core cases of subtitles and captions.

> If we don't use HTML wholesale, then there's really no reason to use HTML 
> at all. (And using HTML wholesale is not really an option, as you say 
> above.)

I disagree. The most obvious way of reusing existing infrastructure in browsers, the most obvious way of getting support for future syntax changes that support attributes or new tag names and the most obvious way to get error handling that behaves in the way the appearance of the syntax suggests is to reuse the HTML fragment parsing algorithm for parsing the cue text.

> On Thu, 16 Jul 2009, Philip Jägenstedt wrote:
>> 
>> There are already more formats than you could possibly want on the scale 
>> between SRT (dumb text) and complex XML formats like DFXP or USF (used 
>> in Matroska).
> 
> Indeed. I tried to examine all of them, but many had no documentation that 
> I could find. The results are in the wiki page cited above.

Using the WebSRT container to transfer potentially arbitrary HTML has the benefit of scaling down as well as (Web)SRT while also scaling up to pretty much anything (esp. with SVG-in-HTML).

> I've defined some CSS extensions to allow us to use CSS with SRT.

The new CSS pseudos would be unnecessary if each cue formed a DOM by parsing "<!DOCTYPE html>" as HTML (to get a skeleton DOM in the standards mode) and then document.body.innerHTML were set to the cue text.

This way, to style the entire cue, the author would select html or body. There'd be no need for ::cue. Likewise, there'd be no need for the ::cue-part stuff if the voice became the className of either the root of body and the rest of cue settings were set as attributes (on root or body).

>> Further, SRT has no way to specify which language it is written in
> 
> What's the use case?

CJK font selection. Also speech generator language selection if timed text is used to drive synthetic audio description.

>> I actually quite like the general idea behind Silvia's 
>> http://wiki.xiph.org/Timed_Divs_HTML
>> 
>> This is somewhat similar to the <timerange> proposal that David Singer 
>> and Eric Carlson from Apple have brought up a few times.
> 
> I am very reluctant to have such a verbose format be used for such dense 
> data files as captions. It works for HTML because the use of markup is 
> balanced with the text (though it can get pretty borderline, e.g. the HTML 
> spec itself has a high ratio of markup to text). It's not a nice format 
> for more dense data, IMHO.

I agree. Furthermore, the WebSRT container is better suited for multiplexing the same captioning format into the video file, because it doesn't have a root element and it doesn't create the expectation that the entire Timed DIV markup exists in a stylable DOM at a time.

If Timed DIVs were multiplexed into a video file, the solution would need to support seeking. If content were incrementally appended to one DOM containing the entire captioning file, the DOM could look different based on seeking history. This would make sibling selectors, nth-child, etc. match differently based on seeking history, which would be very bad. Therefore, we'd need to contain each cue into a mini-DOM--just as with WebSRT. To keep the processing model consistent between standalone files and multiplexed captions, the mini-DOMs would need to be used even in the standalone file case. However, since the entire Timed DIV file is markup, the author expectation would be that there's one DOM for everything. 

Using the WebSRT container for transporting markup snippets makes the expectations set by the appearance of the format match the processing model, which is nice.

>> - Not usable outside the web platform (i.e. outside of web browsers).
> 
> The last of these is pretty critical, IMHO.

Even if we enabled full HTML in browsers, the vast, vast majority of WebSRT files wouldn't use fancy markup. HTML parsers are / will become off-the-shelf software, so parsing isn't a problem. For the rendering side, we can sprinkle the "CSS is optional" pixie dust and non-browser apps will be just fine with the vast majority of subtitling or captioning WebSRT files.

> It would also result in some pretty complicated situations, like captions 
> containing <video>s themselves.

If the processing is defined in terms of nested browsing contexts, the task queue and innerHTML setter, the "right" behavior falls out of that.

>> Pros:
>> + Styling using CSS and only CSS.
> 
> We'd need extensions (for timing, to avoid different caption streams 
> overlapping), so I think this would end up being no better than what we've 
> ended up with with WebSRT.

Above I outlined how WebSRT with innerHTML setter parsing for cues doesn't need selector extensions at all.

> WebSRT has classes, if I understand you correctly (search for "voice").

So let's make it so that a class selector matches on voice (see above).

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/