[whatwg] Timed tracks for <video>

Philip Jägenstedt philipj at opera.com
Fri Jul 23 05:39:31 PDT 2010

On Fri, 23 Jul 2010 07:40:57 +0200, Ian Hickson <ian at hixie.ch> wrote:

> I recently added to the HTML spec a mechanism by which external subtitles
> and captions can be added to videos in HTML.

Thanks for all the time spent putting this together, now it's our turn to  
review it to pieces again :)


> On Thu, 16 Jul 2009, Philip Jägenstedt wrote:
>> In my layman opinion both extremes make sense, but anything in between
>> I'm rather skeptical to.
> Is the SRT variant described in the spec extreme enough to make sense?

It's hard to quantify, I'll simply criticize the things I don't like below.

Generally, given <track kind=metadata>, you have the tools to go as  
extreme as you want with scripts, SVG and CSS, at the cost of only working  
inside a web browser.

>> As far as I can tell no browser wants to implement the addCueRange API
>> (removing this should be the topic of a separate mail), so we really
>> need to re-think this part and I think that timed text plays an
>> important part here.
> The addCueRange() API has been removed and replaced with a feature based
> on the subtitle mechanism.

Without having reviewed this in detail, I'm pretty happy with how it  
turned out. I'm not a fan of pauseOnExit, though, mostly because it seems  
non-trivial to implement. Since it is last in the argument list of  
TimedTrackCue, it will be easy to just ignore when implementing. I still  
don't think the use cases for it are enough to motivate the implementation  

> On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
>> * It is unclear, which of the given alternative text tracks in different
>> languages should be displayed by default when loading an <itext>
>> resource. A @default attribute has been added to the <itext> elements to
>> allow for the Web content author to tell the browser which <itext>
>> tracks he/she expects to be displayed by default. If the Web author does
>> not specify such tracks, the display depends on the user agent (UA -
>> generally the Web browser): for accessibility reasons, there should be a
>> field that allows users to always turn display of certain <itext>
>> categories on. Further, the UA is set to a default language and it is
>> this default language that should be used to select which <itext> track
>> should be displayed.
> It's not clear to me that we need a way to do this; by default presumably
> tracks would all be off unless the user wants them, in which case the
> user's preferences are paramount. That's what I've specced currently.
> However, it's easy to override this from script.

It seems to me that this is much like <video autoplay> in that if we don't  
provide a markup solution, everyone will use scripts and it will be more  
difficult for the UA to override with user prefs.

> On Fri, 31 Jul 2009, Philip Jägenstedt wrote:
>> * Security. What restrictions should apply for cross-origin loading?
> Currently the files have to be same-origin. My plan is to wait for CORS  
> to
> be well established and then use it for timed tracks, video files, images
> on <canvas>, text/event-stream resources, etc.

If I'm interpreting the track fetch algorithm correctly cross-origin is  
strictly enforced and treated as a network error. This is different from  
e.g. <img> and <video>, but it seems to make things simpler, so I'm fine  
with that. It also ensures that JavaScript fallback handling of <track>  
won't fail just because of cross-origin in XHR.

>> * Complexity. There is no limit to the complexity one could argue for
>> (bouncing ball multi-color karaoke with fan translations/annotations
>> anyone?). We should accept that some use cases will require creative use
>> of scripts/SVG/etc and not even try to solve them up-front. Draw a line
>> and stick to it.
> Agreed. Hopefully you agree with where I drew the line! :-)

Actually, I think both karaoke (in-cue timestamps) and ruby are  
borderline, but it depends on how difficult it is to implement.

One thing in particular to note about karaoke is that even with in-cue  
timestamps, CSS still won't be enough to get the typical effect of  
"wiping" individual characters from one style to another, since the  
smallest unit you can style is a single character. To get that effect  
you'd have to render the characters in two styles and then cut them  
together (with <canvas> or just clipping <div>s). Arguably, this is a  
presentation issue that could be fixed without changing WebSRT.

> On Thu, 15 Apr 2010, Silvia Pfeiffer wrote:
>> Further, SRT has no way to specify which language it is written in
> What's the use case?

As hints for font selection and speech synthesis.

> On Thu, 15 Apr 2010, Philip Jägenstedt wrote:
>> While I don't favor TTML, I also don't think that extending SRT is a
>> great way forward, mostly because I don't see how to specify the
>> language (which sometimes helps font selection),
> That's done in the <track> element. It can't be in the file, since you
> need to know it before downloading the file (otherwise you'd have to
> download all the files to update the UI).

Good enough. Multi-lingual subtitles are still a problem, but also very  
rare (the only ones I've seen are those that I wrote myself).

>> apply document-wide styling,
> I just used the document's own styles.
>> reference external style sheets,
> I just did that from the document.
>> use webfonts, etc...
> Since the styles come from the document, the fonts come from there too.

I'm happy with how this turned out.

>> I actually quite like the general idea behind Silvia's
>> http://wiki.xiph.org/Timed_Divs_HTML
>> This is somewhat similar to the <timerange> proposal that David Singer
>> and Eric Carlson from Apple have brought up a few times.
> I am very reluctant to have such a verbose format be used for such dense
> data files as captions. It works for HTML because the use of markup is
> balanced with the text (though it can get pretty borderline, e.g. the  
> spec itself has a high ratio of markup to text). It's not a nice format
> for more dense data, IMHO.
>> No matter the syntax, the idea is basically to allow marking up certain
>> parts of HTML as being relevant for certain time ranges. A CSS
>> pseudo-selector matches the elements which are currently active, based
>> on the current time of the media.
>> So, the external subtitle file could simply be HTML, [...]
>> Cons:
>> - Politics.
>> - New format for subtitle authors and tools.
>> - Not usable outside the web platform (i.e. outside of web browsers).
> The last of these is pretty critical, IMHO.
> It would also result in some pretty complicated situations, like captions
> containing <video>s themselves.


>> Pros:
>> + Styling using CSS and only CSS.
> We'd need extensions (for timing, to avoid different caption streams
> overlapping), so I think this would end up being no better than what  
> we've
> ended up with with WebSRT.

Agreed, this has been mostly solved, although the positioning of  
individual cues is still not controlled by CSS but rather by e.g. L:50%. I  
don't understand yet how this will interact with e.g. text-align,  
font-size, etc from CSS.

>> + Well known format to web authors and tools.
> SRT is pretty well-known in the subtitling community.


>> + High reuse of existing implementations.
> I think the incremental cost of implementing WebSRT is pretty minimal; I
> tried to make it possible for a browser to reuse all the CSS
> infrastructure, for instance.


>> + You could author CSS to make the HTML document read as a transcript
>> when opened directly.
> That isn't a use case I considered. Is it a use case we should address?

It could be useful, but I think a transcript could just as well be  
generated from the WebSRT input on the server-side or using scripts.

>> + <timerange> reusable for page-embedded timed markup, which was the
>> original idea.
> I didn't end up addressing this use case. I think if we do this we should
> seriously consider how it interacts with SMIL/SVG. I also think it's
> something we should look at in conjunction with synchronising multiple
> <video> or <audio> elements, e.g. to do audio descriptions, dubbing,
> sign-language video overlays, split-screen video, etc.

This is pretty well covered by <track kind=metadata> and by registering  
cues with callbacks.

>> I'm also confused about the removal of the chapter tracks. These are
>> also time-aligned text files and again look very similar to SRT.
> I've also included support for chapters. Currently this support is not
> really fully fleshed out; in particular it's not defined how a UA should
> get chapter names out of the WebSRT file. I would like implementation
> feedback on this topic -- what do browser vendors envisage exposing in
> their UI when it comes to chapters? Just markers in the timeline? A
> dropdown of times? Chapter titles? Styled, unstyled?

A sorted list of chapters in a context menu at minimum, with the name of  
the chapter and probably the time where it starts. More fancy things on  
the timeline would be cool, but given that it's going to look completely  
different in all browsers and not be stylable I wonder if there's much  
point to it.

Finally, random feedback:


The distinction between subtitles and captions isn't terribly clear.

It says that subtitles are translations, but plain transcriptions without  
cues for the hard of hearing would also be subtitles.

How does one categorize translations that are for the HoH?

In my opinion the fact that something is in the same language as what is  
spoken isn't something the UA can do anything useful with, and could be  
deduced from the language of the audio track anyway. Therefore, the  
language need not be part of the categorization. Instead, simply say that  
captions are subtitles that also have cues like [doorbell rings] for the  
hard of hearing.

Alternatively, might it not be better to simply use the voice "sound" for  
this and let the default stylesheet hide those cues? When writing  
subtitles I don't want the maintenance overhead of 2 different versions  
that differ only by the inclusion of [doorbell rings] and similar.  
Honestly, it's more likely that I just wouldn't bother with accessibility  
for the HoH at all. If I could add it with <sound>doorbell rings, it's far  
more likely I would do that, as long as it isn't rendered by default. This  
is my preferred solution, then keeping only one of kind=subtitles and  
kind=captions. Enabling the HoH-cues could then be a global preference in  
the browser, or done from the context menu of individual videos.

If we must have both kind=subtitles and kind=captions, then I'd suggest  
making the default subtitles, as that is without a doubt the most common  
kind of timed text. Making captions the default only means that most timed  
text will be mislabeled as being appropriate for the HoH when it is not.

Philip Jägenstedt
Core Developer
Opera Software

More information about the whatwg mailing list