[whatwg] Timed tracks: feedback compendium

Philip Jägenstedt philipj at opera.com
Mon Jan 3 07:57:50 PST 2011

On Sat, 25 Dec 2010 07:39:02 +0100, Ian Hickson <ian at hixie.ch> wrote:

>  + I've made <track> have a feature whereby a track can be enabled by
>    default so that users who would otherwise not have any tracks enabled
>    will get it enabled (without overriding the preferences of users who
>    would have some other track enabled by default).

For the lazy, it's <track default>, see  

>    text/vtt respectively.
>  + I've added a magic string that is required on the format to make it
>    recognisable in environments with no or unreliable type labeling.

Is there a reason it's "WEBVTT FILE" instead of just "WEBVTT"? "FILE"  
seems redundant and like unnecessary typing to me.

> On Wed, 8 Sep 2010, Sam Dutton wrote:
>> >>
>> >> Also -- is trackgroup out of the spec?
>> >
>> > What is trackgroup?
>> I'd seen this in the Media TextAssociations documentation:
>> http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations#Examples
> The feature is mostly there, it's just expressed differently (it's done  
> in
> a way similar to how <link> works: you specify all the relevant  
> attributes
> on each <track>).
> On Wed, 8 Sep 2010, Philip Jägenstedt wrote:
>> In the discussion on public-html-a11y <trackgroup> was suggested to
>> group together mutually exclusive tracks, so that enabling one
>> automatically disables the others in the same trackgroup.
>> I guess it's up to the UA how to enable and disable <track>s now, but
>> the only option is making them all mutually exclusive (as existing
>> players do) or a weird kind of context menu where it's possible to
>> enable and disable tracks completely independently. Neither options is
>> great, but as a user I would almost certainly prefer all tracks being
>> mutually exclusive and requiring scripts to enable several at once.
> It's not clear to me what the use case is for having multiple groups of
> mutually exclusive tracks.
> The intent of the spec as written was that a browser would by default  
> just
> have a list of all the subtitle and caption tracks (the latter with
> suitable icons next to them, e.g. the [CC] icon in US locales), and the
> user would pick one (or none) from the list. One could easily imagine a  
> UA
> allowing the user to enable multiple tracks by having the user ctrl-click
> a menu item, though, or some similar solution, much like with the  
> commonly
> seen select box UI.

In the vast majority of cases, all tracks are intended to be mutually  
exclusive, such as English+English HoH or subtitles in different  
languages. No media player UI (hardware or software) that I have ever used  
allows enabling multiple tracks at once. Without any kind of hint about  
which tracks make sense to enable together, I can't see desktop Opera  
allowing multiple tracks (of the same kind) to be enabled via the main UI.

>> > On Fri, 6 Aug 2010, Philip Jägenstedt wrote:
>> > >
>> > > I'm not particularly fond of the current voice markup, mainly for 2
>> > > reasons:
>> > >
>> > > First, a cue can only have 1 voice, which makes it impossible to
>> > > style cues spoken/sung simultaneously by 2 or more voices. There's a
>> > > karaoke example of this in
>> > >  
>> <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_voices>
>> >
>> > That's just two cues.
>> I'm not sure what you're saying. The male singer's cues are in blue, the
>> female singer's are in red and the part sung together is in green. Are
>> you saying that the last cue should be made into two cues, or something
>> else?
> I would just have the three be labeled as three different voices. (I
> thought you were referring to two people saying two different things on
> the screen at the same time, which would be two cues.)
>> > > I would prefer if voices could be mixed, as such:
>> > >
>> > > 00:01.000 --> 00:02.000
>> > > <1> Speaker 1
>> > >
>> > > 00:03.000 --> 00:04.000
>> > > <2> Speaker 2
>> > >
>> > > 00:05.000 --> 00:06.000
>> > > <1><2> Speaker 1+2
>> >
>> > What's the use case?
>> To use a different style for the cues that are sung together, so that
>> you know when it's your turn to sing.
> It's not clear whether multiple voices is really necessary. Can't you  
> just
> do (using the new syntax):
>  00:01.000 --> 00:02.000
>  <v Bob> Speaker 1
> 00:03.000 --> 00:04.000
>  <v Jim> Speaker 2
> 00:05.000 --> 00:06.000
>  <v Bob and Jim> Speaker 1+2
> ...where "Bob and Jim" is a third name?

Sure, one could, but the new syntax/parsing also allows <v Bob><v Jim>  
Speaker 1+2, which is what I requested.

Using this syntax, I would expect some confusion when you omit the closing  
</v>, when it's *not* a cue spoken by two voices at the same time, such as:

<v Jim>- Boo!
<v Bob>- Gah!

Gah! is spoken by both Jim and Bob, but that was likely not intended. If  
this causes confusion, we should make validators warn about multiple  
voices with with no closing </v>.

>> > On Tue, 24 Aug 2010, Philip Jägenstedt wrote:
>> > >
>> > > Here's the SRT research I promised:
>> > > http://blog.foolip.org/2010/08/20/srt-research/
>> >
>> > Awesome! Thanks for this.
>> >
>> > Addressing points in the same order:
>> >
>> > - charset: resolved by introducing a charset override.
>> Oh well, that's better than sniffing the encoding or trusting
>> Content-Type I guess.
> Based on the additional research you and others provided, I've removed  
> the
> charset="" attribute again, and given up on the idea of supporting legacy
> content unmodified.

One charset to rule them all, excellent!

>> > - blank lines not separating cues: I couldn't find a client that
>> >   supported missing the blank line, so I didn't support that. It's a
>> >   small number of files, and a small number of cues within those  
>> files,
>> >   I presume, so I'm not too worried.
>> Indeed, I couldn't find one either, the players I tested instead
>> rendered the timing line and following cue text together with the
>> previous cue, just like a WebSRT implementation would. What we could do
>> to slightly improve the situation is to make --> invalid in the cue
>> text, so that validators could warn about this. That would require
>> adding a > escape for >, so I'm not sure it's worth it. Perhaps
>> validators could warn about it regardless of the spec.
> Certainly if the rest of the line matches a timing line, a warning
> wouldn't be a bad idea, but I don't know that it should be invalid.

OK, this is something for validators, I agree.

>> For captions and subtitles
>> it's less common, but rendering it underneath the video rather than on
>> top of it is not uncommon, e.g.
>> http://nihseniorhealth.gov/video/promo_qt300.html or
> Conceptually, that's in the video area, it's just that the video isn't
> centered vertically. I suppose we could allow UAs to do that pretty
> easily, if it's commonly desired.

It's already possible to align the video to the top of its content box  
using <http://dev.w3.org/csswg/css3-images/#object-position>:

video { object-position: center top }

(This is already supported in Opera, but prefixed: -o-object-position)

>> > On Tue, 24 Aug 2010, Silvia Pfeiffer wrote:
>> > >
>> > > I believe [SVG etc] will be [added to WebSRT]. But since we are only
>> > > looking at the ways in which captions and subtitles are used
>> > > currently, we haven't accepted this as an important use case, which
>> > > is fair enough. I am considering likely future use though, which is
>> > > always hard to argue.
>> >
>> > In all my research for subtitles, I found very few cases of anything
>> > like this. Even DVDs, whose subtitle tracks are just hardcoded bitmap
>> > images, don't do anything fancy with them... just plain text and
>> > italics, generally. Why haven't people started doing fancy stuff with
>> > subtitles in all the years that we've had TVs? It's not like they
>> > can't do it.
>> SVG on the TV? All that was possible was teletext type graphics and
>> indeed, people did a lot of graphics there, e.g.
>> http://www.google.com.au/images?q=teletext .
> Not in subtitles though.

Note that in Sweden captioning for the HoH is delivered via the teletext  
system, which would allow ASCII-art to be displayed. Still, I've never  
seen it. The only case of graphics being used in "subtitles" I can  
remember ever seeing is the DVD of  
<http://en.wikipedia.org/wiki/Cat_Soup>, where the subtitle system is  
(ab)used to overlay some graphics.

> On Fri, 10 Sep 2010, Philip Jägenstedt wrote:
>> Not being convinced we need anything more than simple key-value headers
>> in a header, I still looked at the options for comments:
>> Making any line with a --> in it be a comment would hide a lot of broken
>> cues from validators, so I think we shouldn't do this.
>> ; appears at the beginning of lines in 15/10000 files and most don't
>> look like they're intended as comments.
>> # appears at the beginning of lines in 244/10000 files and most don't
>> look like they're intended as comments.
>> /* only appears in 3/10000 files, so CSS-style comments might work, but
>> does add some complexity
>> // appears at the beginning of lines in 5/10000 files and most look like
>> that *are* intended as comments or are garbage, so it should work.
>> (data from OpenSubtitles sample)
> Thanks for the data.
> If we're going to change the spec to require a magic header, then the
> legacy content is of limited importance at this point, so I guess it
> doesn't really matter. We can add any kind of backwards-compatible new
> syntax in the future.

If we ever want comments, we need to add support in the parser before any  
content accidentally uses the syntax, in other words pretty soon now.

> On Mon, 13 Sep 2010, Philip Jägenstedt wrote:
>> OK, I think we mostly agree. Any default will sometimes be wrong, so to
>> not have to choose between subtitles and captions, I'd still really
>> prefer if specific HoH-tags like <sound> can be shown or hidden
>> depending on user preference. I think that would lead to more content
>> actually being written for HoH users, as it doesn't requiring
>> maintaining 2 different files.
> My main concern with this is that it's not been done before. Given the
> apparent simplicitly of the feature, it seems that there must be some
> reason for this. One reason might be that it's never come up before.
> Another reason might be that captions and subtitles are different in
> subtle ways that make this inappropriate in practice. I think before we
> add this feature, we should try to understand the history here.
> Note that even if we were able to point to a single file for both
> subtitles and captions, we'd still have to specify them separately in the
> markup, so that the UI could correctly identify the two options without
> having to download and parse the two tracks first. So I'm not sure it
> would actually help with the problem in question (the default kind).

Agreed, some experimentation in this area would be useful before spec'ing  

> On Tue, 14 Sep 2010, Anne van Kesteren wrote:
>> Apart from text/plain I cannot think of a "web" text format that does
>> not have comments.
> But what's the use case? Is it really useful to have comments in a
> subtitle file?

Being able to put licensing/contact information at the top of the file  
would be useful, just as it is in JavaScript/CSS.

> On Fri, 22 Oct 2010, Simon Pieters wrote:
>> > It can still be inspired by it though so we don't have to change much.
>> > I'd be curious to hear what other things you'd clean up given the
>> > chance.
>> WebSRT has a number of quirks to be compatible with SRT, like supporting
>> both comma and dot as decimal separators, the weird parsing of
>> timestamps, etc.
> I've cleaned the timestamp parsing up. I didn't see others.

I consider the cue id line (the line preceding the timing line) to be  
cruft carried over from SRT. When we now both have classes and the  
possibility of getting a cue by index, so why do we need it?

I suggest making it invalid and having the parser ignore it. Add it back  
only if people actually ask for it.

> On Tue, 5 Oct 2010, Philip Jägenstedt wrote:
>> At the Open Subtitles Design Summit, there was some discussion about
>> captioning for the HoH. I've already put this input into a related bug
>> [2], but to summarize: The default rendering for the voices syntax
>> should probably be to prefix the text cue with the name of the speaker,
>> not to do anything funny with colors or positioning. What's less clear
>> is if it's annoying to always prefix with the speaker, or if it should
>> be done only to disambiguate.
>> [2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320
> The bug is about the syntax, nor the rendering.
> The syntax suggestion makes sense, and I've updated the spec accordingly.
> I'm not sure what to do with the rendering. Could you elaborate on why we
> would show the speaker names? Surely it would be better to let authors
> explicitly put in the speaker names in the text of the cue if that's what
> they want.

Having watched a bit more captions since the previous mail, I've noticed  
that when prefixing the name, it isn't always prefixed on all cues,  
sometimes only when it's needed to avoid ambiguity. I agree that no  
special default style is needed here, if one wants it it's now easy to do  
with CSS.

>> * Note that certain ale5000-SRT syntax is not part of WebSRT, so that
>> one doesn't have to debug the parsing algorithm to learn that.
> Could you suggest some text for this? It's unclear to me exactly what
> would be helpful here. In fact, noting that the language is no longer
> called "SRT", is this still necessary?

Changing the name is quite enough to avoid confusion about this.

>> There was also some discussion about metadata. Language is sometimes
>> necessary for the font engine to pick the right glyph.
> Could you elaborate on this? My assumption was that we'd just use CSS,
> which doesn't rely on language for this.

It's not in any spec that I'm aware of, but some browsers (including  
Opera) pick different glyphs depending on the language of the text, which  
really helps when rendering CJK when you have several CJK fonts on the  
system. Browsers will already know the language from <track srclang>, so  
this would be for external players.

>> License is also an often requested piece of metadata.
> That would get solved if we allowed comments, which is how it's solved in
> JS (CSS and HTML don't usually get licensed for some reason). I've punted
> on this for now, so as to not make too many parsing changes at once.

Again, comments need to be added before there's any content accidentally  
using the syntax.

>> Finally, some things I think are broken in the current WebSRT parser:
>> * Parsing of timestamps is more liberal than it needs to be. In
>> particular, treating the part after the decimal separator as an integer
>> and dividing by 1000 leads to 00:00:00.1 being interpreted as 0.001
>> seconds, which is weird. This is what e.g. VLC does, but if we need to
>> add a header we could just as well change this to make more sane.
>> Alternatively, if we want to really align with C implementations using
>> scanf, we should also handle negative numbers (00:01:-5,000 means 55
>> seconds), octal and hexadecimal.
> Fixed.

Allowing the hours to be any number of digits made it more complex than  
necessary though, see  

>> * The "bad cue" handling is stricter than it should be. After collecting
>> an id, the next line must be a timestamp line. Otherwise, we skip
>> everything until a blank line, so in the following the parser would jump
>> to "bad cue" on line "2" and skip the whole cue.
>> 1
>> 2
>> 00:00:00.000 --> 00:00:01.000
>> Bla
>> This doesn't match what most existing SRT parsers do, as they simply
>> look for timing lines and ignore everything else. If we really need to
>> collect the id instead of ignoring it like everyone else, this should be
>> more robust, so that a valid timing line always begins a new cue.
>> Personally, I'd prefer if it is simply ignored and that we use some form
>> of in-cue markup for styling hooks.
> The IDs are useful for referencing cues from script, so I haven't removed
> them. I've also left the parsing as is for when neither the first nor
> second line is a timing line, since that gives us a lot of headroom for
> future extensions (we can do anything so long as the second line doesn't
> start with a timestamp and "-->" and another timestamp).

In the case of feeding future extensions to current parsers, it's way  
better fallback behavior to simply ignore the unrecognized second line  
than to discard the entire cue. The current behavior seems unnecessarily  
strict and makes the parser more complicated than it needs to be. My  
preference is just ignore anything preceding the timing line, but even if  
we must have IDs it can still be made simpler and more robust than what is  
currently spec'ed.

Philip Jägenstedt
Core Developer
Opera Software

More information about the whatwg mailing list