[whatwg] Timed tracks: feedback compendium

Fri Dec 24 22:39:02 PST 2010

Summary of major changes:

 + I've changed the selector syntax for styling cues in CSS to use full 
   selectors rather than the earlier shallow syntax.
 + I've changed the Voice syntax to allow multiple voices per cue and
   to use names rather than numbers for voices.
 + I've added a <c> cue span construct to which classes can be applied,
   to allow greater styling control.
 + I've also allowed classes to be specified on all other constructs in 
   the timed track language. It's a terser syntax than HTML:
    <c.sfx>Boom</c>
    <v Hippo Hero>Hello Sir. I'm dropping now.</v>
    <v Policeman><b.loud>The road is <i.stress>that</i> way!</b>
    <c.credit.author>Written by Foo</c> <c.credit.editor>Edited by Bar</c>
 + I've made <track> have a feature whereby a track can be enabled by 
   default so that users who would otherwise not have any tracks enabled 
   will get it enabled (without overriding the preferences of users who 
   would have some other track enabled by default).
 + I've added some non-normative text to <track>'s definition of kind=""
   to explain the implications of the attribute.
 + I've renamed WebSRT to WebVTT after receiving feedback from people 
   regarding two separate issues: one, on this list, that the spec isn't 
   really compatible with legacy SRT and that making it fully compatible 
   would likely not be a good direction to go in, so we should have a name 
   that reflects that, and two, in private, that the name "SRT" has 
   negative connotations with certain content companies and that changing 
   the name would be a trivial way of increasing the likely reach of the 
   technology. WebVTT stands for "Web Video Text Track format".
 + I've also updated the extension and MIME type of the format to .vtt and 
   text/vtt respectively.
 + I've added a magic string that is required on the format to make it 
   recognisable in environments with no or unreliable type labeling.
 + I've required that content after the magic string and before the first 
   blank line be ignored, for future extensibility.
 + I've dropped the charset="" attribute on <track> since it was only 
   needed to support legacy SRT files.
 + I've removed aspects of the parser that were only useful for parsing 
   legacy SRT files (specifically, the complex timestamp parsing).

On Wed, 24 Nov 2010, Eric Winkelman wrote:
>
> I'm investigating how TimedTracks can be used for in-band-data-tracks 
> within MPEG transport streams (used for cable television).
> 
> In this format, the number and types of in-band-data-tracks can change 
> over time.  So, for example, when the programming switches from a 
> football game to a movie, an alternate language track may appear that 
> wasn't there before.  Later, when the programming changes again, that 
> language track may be removed.
> 
> It's not clear to me how these changes are exposed by the proposed Media 
> Element events.
> 
> The "loadedmetadata" event is used to indicate that the TimedTracks are 
> ready, but it appears that it is only fired before playback begins.  Is 
> this event fired again whenever a new track is discovered?  Is there 
> another event that is intended for this situation?
> 
> Similarly, is there an event that indicates when a track has been 
> removed?  Or is this also handled by the "loadedmetadata" event somehow?

Is the number of text timed tracks in such a situation capped to a 
particular maximum, or is it truly unbounded?

On Fri, 5 Nov 2010, Bruce Lawson wrote:
>
> http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#sourcing-in-band-timed-tracks 
> says to create TimedTrack objects etc for in-band tracks which are then 
> exposed in the API - so captions/subtitles etc that are contained in the 
> media container file are exposed, as well as those tracks pointed to by 
> the <track> element.
> 
> But 
> http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#timed-track-api 
> implies that the array is only of tracks in the track element:
> 
> "media . tracks . length
> 
> Returns the number of timed tracks associated with the media element 
> (e.g. from track elements). This is the number of timed tracks in the 
> media element's list of timed tracks."

I don't understand why you interpret this as implying anything about the 
track element. Are you interpreting "e.g." as "i.e."?

> Suggestion: amend to say "Returns the number of timed tracks associated 
> with the media element (e.g.  from track elements and any in-band track 
> files inside the media container file)" or some such.

I'd rather avoid talking about the in-band ones here, in part because I 
think it's likely to confuse authors at least as much as help them, and in 
part because the terminology around in-band timed tracks is a little 
unclear to me and so I'd rather not talk about them in informative text. :-)

If you disagree, though, let me know. I can find a way to make it work.

On Thu, 7 Oct 2010, James Graham wrote:
> 
> One more from me: the spec is unusually hard to follow here since it 
> makes extensive use of goto for flow control. Could it not be 
> restructured as a state machine or something so it is easier to follow 
> what is going on?

I'm happy to make it easier to read, if you have any concrete suggestions 
for how to do it. In general I have found that structured loops translate 
very poorly to English. I try to keep my "gotos" equivalent to structured 
loops so that they turn into sane code, FWIW.

On Wed, 8 Sep 2010, Sam Dutton wrote:
> >>
> >> Also -- is trackgroup out of the spec?
> > 
> > What is trackgroup?
> 
> I'd seen this in the Media TextAssociations documentation:
> 
> http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations#Examples

The feature is mostly there, it's just expressed differently (it's done in 
a way similar to how <link> works: you specify all the relevant attributes 
on each <track>).

On Wed, 8 Sep 2010, Philip JÃ¤genstedt wrote:
>
> In the discussion on public-html-a11y <trackgroup> was suggested to 
> group together mutually exclusive tracks, so that enabling one 
> automatically disables the others in the same trackgroup.
> 
> I guess it's up to the UA how to enable and disable <track>s now, but 
> the only option is making them all mutually exclusive (as existing 
> players do) or a weird kind of context menu where it's possible to 
> enable and disable tracks completely independently. Neither options is 
> great, but as a user I would almost certainly prefer all tracks being 
> mutually exclusive and requiring scripts to enable several at once.

It's not clear to me what the use case is for having multiple groups of 
mutually exclusive tracks.

The intent of the spec as written was that a browser would by default just 
have a list of all the subtitle and caption tracks (the latter with 
suitable icons next to them, e.g. the [CC] icon in US locales), and the 
user would pick one (or none) from the list. One could easily imagine a UA 
allowing the user to enable multiple tracks by having the user ctrl-click 
a menu item, though, or some similar solution, much like with the commonly 
seen select box UI.

> > On Fri, 6 Aug 2010, Philip JÃ¤genstedt wrote:
> > > 
> > > I'm not particularly fond of the current voice markup, mainly for 2 
> > > reasons:
> > > 
> > > First, a cue can only have 1 voice, which makes it impossible to 
> > > style cues spoken/sung simultaneously by 2 or more voices. There's a 
> > > karaoke example of this in 
> > > <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_voices>
> > 
> > That's just two cues.
> 
> I'm not sure what you're saying. The male singer's cues are in blue, the 
> female singer's are in red and the part sung together is in green. Are 
> you saying that the last cue should be made into two cues, or something 
> else?

I would just have the three be labeled as three different voices. (I 
thought you were referring to two people saying two different things on 
the screen at the same time, which would be two cues.)

> > > I would prefer if voices could be mixed, as such:
> > > 
> > > 00:01.000 --> 00:02.000
> > > <1> Speaker 1
> > > 
> > > 00:03.000 --> 00:04.000
> > > <2> Speaker 2
> > > 
> > > 00:05.000 --> 00:06.000
> > > <1><2> Speaker 1+2
> > 
> > What's the use case?
> 
> To use a different style for the cues that are sung together, so that 
> you know when it's your turn to sing.

It's not clear whether multiple voices is really necessary. Can't you just 
do (using the new syntax):

 00:01.000 --> 00:02.000
 <v Bob> Speaker 1

 00:03.000 --> 00:04.000
 <v Jim> Speaker 2

 00:05.000 --> 00:06.000
 <v Bob and Jim> Speaker 1+2

...where "Bob and Jim" is a third name?

> > > Second, it makes it impossible to target a smaller part of the cue 
> > > for styling. We have <i> and <b>, but there are also cases where 
> > > part of the cue should be in a different color, see 
> > > <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_colors>
> > 
> > Well you can always restyle <i> or <b>.
> 
> That would be quite an abuse of <i> and <b> and would give bogus 
> italics/bold text in standalone players.

I'm not sure I'd call it bogus, but yes.

I've added a feature to enable such styling, though (<c> and classes).

> Yes, that would be better than numerical voices IMO. Unless there's a 
> very good reason for making voices always apply to the whole cue, could 
> we not use the same parsing for voices and other tags (i, b, ruby, rt)?

I've changed it to <v speaker name>.

> Ideally, the CSS extensions 
> (http://wiki.whatwg.org/wiki/Timed_tracks#CSS_extensions) should also 
> work the same for voices and tags, using the normal child selectors 
> would work. Something like video::cue(narrator > i) to style the 
> following cue:
> 
> 00:01.000 --> 00:02.000
> <narrator><i>The story begins
> 
> I'm not sure what constraints CSS syntax puts on the prefix for custom 
> voices, is : safe? Other options might be <@philip> (Twitter style) or 
> <-philip> (vendor prefix style).

This:

  00:01.000 --> 00:02.000
  <v narrator><i>The story begins</i>

...can now be styled as follows:

  ::cue-part([voice="narrator"] > i) { ... }

That is, the name of the speaker is exposed as a "voice" attribute on the 
"v" node. (Thanks to Tab for that idea.)

> > On Tue, 24 Aug 2010, Philip JÃ¤genstedt wrote:
> > > 
> > > Here's the SRT research I promised:
> > > http://blog.foolip.org/2010/08/20/srt-research/
> > 
> > Awesome! Thanks for this.
> > 
> > Addressing points in the same order:
> > 
> > - charset: resolved by introducing a charset override.
> 
> Oh well, that's better than sniffing the encoding or trusting 
> Content-Type I guess.

Based on the additional research you and others provided, I've removed the 
charset="" attribute again, and given up on the idea of supporting legacy 
content unmodified.

> > - blank lines not separating cues: I couldn't find a client that
> >   supported missing the blank line, so I didn't support that. It's a
> >   small number of files, and a small number of cues within those files,
> >   I presume, so I'm not too worried.
> 
> Indeed, I couldn't find one either, the players I tested instead 
> rendered the timing line and following cue text together with the 
> previous cue, just like a WebSRT implementation would. What we could do 
> to slightly improve the situation is to make --> invalid in the cue 
> text, so that validators could warn about this. That would require 
> adding a > escape for >, so I'm not sure it's worth it. Perhaps 
> validators could warn about it regardless of the spec.

Certainly if the rest of the line matches a timing line, a warning 
wouldn't be a bad idea, but I don't know that it should be invalid.

> > - overlapping cues: supporting these is pretty important, so files with
> >   overlapping cues will just have some weird artefects on playback.
> 
> OK, tools to fix SRT timings already exist, so I guess this is 
> manageable.

This would be something that an SRT-to-VTT convertor could support.

> > On Wed, 25 Aug 2010, Philip JÃ¤genstedt wrote:
> > > 
> > > The main reason to care about the MIME type is some kind of "doing 
> > > the right thing" by not letting people get away with misconfigured 
> > > servers. Sometimes I feel it's just a waste of everyone's time 
> > > though, it would generally be less work for both browsers and 
> > > authors to not bother.
> > 
> > Agreed. Not sure what to do for WebSRT though, since there's no good 
> > way to recognise a WebSRT file as opposed to some other format.
> 
> In a <track> context, ignoring Content-Type is certainly the simplest 
> and removes the need to require any specific file extension for local 
> use. Sniffing isn't really an issue since in a top-level context you 
> can't do much of anything interesting with SRT except display it as text 
> (which text/plain would achieve).

Having given up on the goal of strict backwards compatibility, I've added 
a magic string which can help with identification.

On Thu, 9 Sep 2010, Silvia Pfeiffer wrote:
> 
> [...] some text cues will be fairly long and thus certain users cannot 
> read them within the allocated time for the cue. So, making a 
> pauseOnExit() available is a good thing for accessibility.

I would recommend that as a user interface feature, I don't think it makes 
sense to use pauseOnExit() for this feature.

(Note: The citing for the next few pages of quotes is incomplete. I 
apologise for not fully citing everyone.)

> > > > On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
> > > > >
> > > > > * It is unclear, which of the given alternative text tracks in 
> > > > > different languages should be displayed by default when loading 
> > > > > an <itext> resource. A @default attribute has been added to the 
> > > > > <itext> elements to allow for the Web content author to tell the 
> > > > > browser which <itext> tracks he/she expects to be displayed by 
> > > > > default. If the Web author does not specify such tracks, the 
> > > > > display depends on the user agent (UA - generally the Web 
> > > > > browser): for accessibility reasons, there should be a field 
> > > > > that allows users to always turn display of certain <itext> 
> > > > > categories on. Further, the UA is set to a default language and 
> > > > > it is this default language that should be used to select which 
> > > > > <itext> track should be displayed.
> > > >
> > > > It's not clear to me that we need a way to do this; by default 
> > > > presumably tracks would all be off unless the user wants them, in 
> > > > which case the user's preferences are paramount. That's what I've 
> > > > specced currently. However, it's easy to override this from 
> > > > script.
> > >
> > > It seems to me that this is much like <video autoplay> in that if we 
> > > don't provide a markup solution, everyone will use scripts and it 
> > > will be more difficult for the UA to override with user prefs.
> >
> > What would we need for this then? Just a way to say "by the way, in 
> > addition to whatever the user said, also turn this track on"? Or do we 
> > need something to say "by default, override the user's preferences for 
> > this video and instead turn on this track and turn off all others"? Or 
> > something else? It's not clear to me what the use case is where this 
> > would be useful declaratively.
> 
> You have covered all the user requirements and that is good. They should 
> dominate all other settings. But I think we have neglected the authors. 
> What about tracks that the author has defined and wants activated by 
> default for those users that don't have anything else specified in their 
> user requirements? For example, if an author knows that the audio on 
> their video is pretty poor and they want the subtitles to be on by 
> default (because otherwise a user may miss that they are available and 
> they may miss what is going on), then currently they have to activate it 
> with script.

Ah, so not the two options I listed, but instead "if the user's preference 
is to not have any captions showing, then instead, show this caption"? I 
guess that makes sense. I've added such a feature.

> A user whose preferences are not set will thus see this track. For a 
> user whose preferences are set, the browser will turn on the appropriate 
> tracks additionally or alternatively if there is a more appropriate 
> track in the same language (e.g. a caption track over the default 
> subtitle track). If we do this with script, will it not have the wrong 
> effect and turn off what the browser has selected, so is not actually 
> expressing author preferences, but is doing an author override?

Well there's no "not set" for preferences as far as I can tell -- you 
either prefer a particular track or you prefer no track at all. Either 
way, you have a preference. But yes, script would override that 
preference.

> > > > On Thu, 15 Apr 2010, Silvia Pfeiffer wrote:
> > > > >
> > > > > Further, SRT has no way to specify which language it is written 
> > > > > in
> > > >
> > > > What's the use case?
> > >
> > > As hints for font selection
> >
> > Are independent SRT processors really going to do per-language font 
> > selection? How do they do it today?
> 
> In VLC there is an "Advanced Open File..." option in which you can open a
> subtitle file with the video and set the following parameters:
> * FPS
> * delay
> * font size
> * subtitle alignment
> * subtitle text encoding which chooses the charset.

That would still be possible today. It doesn't seem that a language 
metadata field would help with the above though.

> > > and speech synthesis.
> >
> > Are independent SRT processors really going to do audio descriptions 
> > any time soon? I've only ever seen this in highly experimental 
> > settings.
> 
> Once this is usable in the Web context, accessibility people will jump 
> at this opportunity. It has not been possible before. You should see the 
> excitement I always get from blind people when I demonstrate the 
> Elephants Dream video with text audio descriptions. It will totally take 
> off.

It's not clear to me that language metadata would be of that much help for 
speech synthesis in that kind of scenario. It's like in HTML... people 
don't set it, set it wrong, etc, and at the end of the day, the user's 
likely to only have one or two languages he wants spoken and the UA can 
autodetect between them pretty easily.

For captions themselves, speech-synthesis isn't necessary (just listen to 
the original audio track). For subtitles, mixed language tracks would be 
very rare (since the whole point of subtitles is to translate the text 
into a single language).

> > > [...] the positioning of individual cues is still not controlled by 
> > > CSS but rather by e.g. L:50%.
> >
> > I considered this issue carefully when speccing WebSRT. My conclusion 
> > (after watching a lot more TV than I'm used to) was that in practice 
> > subtitle positioning is not strictly a presentational issue -- that 
> > is, you can't just swap one set of styles for another and have equally 
> > good results, you have to control the positioning on a per-cue basis 
> > regardless of the styling. This is because you have to avoid burnt-in 
> > text, or overlap burnt-in text, or because you need to align text with 
> > a speaker, or show which audio channel the text came from (e.g. for 
> > people talking off camera in a very directional sense), etc.
> 
> I agree. However, what stops us from specifying the positioning in CSS? 
> Why a new mechanism? The output of rendering the cues ends up as a set 
> of CSS boxes anyway.

It would be an abuse of CSS to use it for what is semantic data. The whole 
point of CSS is to provide optional switchable style sheets, if we put 
the positioning in CSS we're saying it's presentational.

> > > Alternatively, might it not be better to simply use the voice 
> > > "sound" for this and let the default stylesheet hide those cues? 
> > > When writing subtitles I don't want the maintenance overhead of 2 
> > > different versions that differ only by the inclusion of [doorbell 
> > > rings] and similar. Honestly, it's more likely that I just wouldn't 
> > > bother with accessibility for the HoH at all. If I could add it with 
> > > <sound>doorbell rings, it's far more likely I would do that, as long 
> > > as it isn't rendered by default. This is my preferred solution, then 
> > > keeping only one of kind=subtitles and kind=captions. Enabling the 
> > > HoH-cues could then be a global preference in the browser, or done 
> > > from the context menu of individual videos.
> >
> > I don't disagree with this, but I fear it might be too radical a step 
> > for the caption-authoring community to take at this point.
> 
> I think we have to get over the notion that the existing subtitling 
> community is our target for this format. In fact, the new subtitling 
> community are all the Web developers out there. They are the ones we 
> should target and for them we should make things easier.

I think that would be a rather arrogant position for us to take. 
Realistically, the people writing subtitles today are going to be a big 
part of the people writing subtitles tomorrow, whether that be in 
hobbyist communities or in commercial environments.

> > > If we must have both kind=subtitles and kind=captions, then I'd 
> > > suggest making the default subtitles, as that is without a doubt the 
> > > most common kind of timed text. Making captions the default only 
> > > means that most timed text will be mislabeled as being appropriate 
> > > for the HoH when it is not.
> >
> > Ok, I've changed the default. However, I'm not fighting this battle if 
> > it comes up again, and will just change it back if people don't defend 
> > having this as the default. (And then change it back again if the 
> > browsers pick "subtitles" in their implementations after all, of 
> > course.)
> >
> > Note that captions aren't just for users that are hard-of-hearing. 
> > Most of the time when I use timed tracks, I want captions, because the 
> > reason I have them enabled is that I have the sound muted.
> 
> Hmm, you both have good points. Maybe we should choose something as the 
> default that is not visible on screen, such as "descriptions"? That 
> would avoid the issue and make it explicit for people who provide 
> captions or subtitles that they have to make a choice.

Seems like it'd be better to have a default that at least some people are 
going to use. :-)

> > > > - Use existing technologies where appropriate.
> > > > [...]
> > > > - Try as much as possible to have things Just Work.
> > >
> > > I think by specifying a standalone cue text parser WebSRT fails on 
> > > these counts compared to reusing the HTML fragment parsing algorithm 
> > > for parsing cue text.
> >
> > HTML parsing is a disaster zone that we should avoid at all costs, 
> > IMHO. I certainly don't think it would make any sense to propagate 
> > that format into anywhere where we don't absolutely have to propagate 
> > it.
> 
> A WebSRT authoring application does not have to create all markup that a 
> HTML fragment parser supports. It would only use what it sees necessary 
> for the use cases that it targets.

It's the parsing ones I'm concerned about. The generating tools will have 
no problem outputting all kinds of complicated stuff without our help.

> Browsers are WebSRT players that will consume the HTML fragments created 
> by such authoring applications. In addition, browsers will also be able 
> to consume richer HTML fragments that were created as time-aligned 
> overlays for video with more fancy styling by Web developers. Something 
> like http://people.mozilla.com/~prouget/demos/vp8/ (you need Firefox for 
> it). Where it says "This movie will eat your planet", you could have 
> fancy timed text.
> 
> Just as much as there is a need for basic captions and subtitles, there 
> is also a need for fancy time-aligned HTML fragments. It would be very 
> strange if, in order to get that working, people would need to use the 
> "metadata" part of the WebSRT spec.

I don't think it would be strange. I think it would be completely 
reasonable for us to not handle those use cases at all, personally. That 
we provide hooks to enable it is a bonus.

> > > > If we don't use HTML wholesale, then there's really no reason to 
> > > > use HTML at all. (And using HTML wholesale is not really an 
> > > > option, as you say above.)
> > >
> > > I disagree. The most obvious way of reusing existing infrastructure 
> > > in browsers, the most obvious way of getting support for future 
> > > syntax changes that support attributes or new tag names and the most 
> > > obvious way to get error handling that behaves in the way the 
> > > appearance of the syntax suggests is to reuse the HTML fragment 
> > > parsing algorithm for parsing the cue text.
> >
> > HTML parsing is one of the most convoluted, quirk-laden, unintuitive 
> > and expensive syntaxes... Its extensibility story is a disaster 
> > (there's so many undocumented and continually evolving constraints 
> > that any addition is massively expensive), its implementation drags 
> > with it all kinds of crazy dependencies on the DOM, event loop 
> > interactions, scripting, and so forth, and it has a highly 
> > inconsistent syntax.
> >
> > I'm not at all convinced reusing it would be "obvious".
> 
> It is obvious to anyone who is not on a standards body. :-)
> 
> But seriously: all the things you mention above are advantages: all this 
> stuff has been solved for HTML and will not have to be solved again if 
> we reuse it.

The problems I raised haven't been solved at all! HTML is still 
convoluted, quirk-laden, unintuitive, expensive, with a bad extension 
story, with undocumented and continually evolving constraints, with crazy 
dependencies, and its syntax is still inconsistent.

> Anything new will inevitably go through a similar development path.

I am not at all convinced that such an outcome is inevitable nor that that 
should mean we should just jump straight to the bad end result.

> I don't see this as the opportunity to re-invent HTML when in fact for 
> anyone out there HTML is working just fine.

It's not working at all for timed tracks, it's working for documents. HTML 
is like a printing press, when what you need here is a telephone. It's 
simply not appropriate.

More important, consider the use cases. The use cases we have for timed 
tracks argue for syntax that makes it easy to set voices and that support 
basic styling and in-cue timings, but do not argue for scripting, embedded 
plugins, videos, exposing a mutable cue DOM, or any such features. Yet if 
we use HTML, we'd have all of the latter, no in-cue timings, etc. 
Basically HTML simply doesn't match the use cases.

Now if there are other use cases then maybe we should do something 
different than what we have; that's why we studied use cases first. There 
are certainly features that we've explicitly not handled, e.g. the 
YouTube-style interactive annotations, and overlaid advertising. It's not 
clear that HTML would actually be the right way to address these either. 

For example, for advertising it's not just an interactive HTML frame that 
appears over the video for a fixed time -- it's generally a frame that 
appears at a fixed time and then stays until manually dismissed, and when 
dismissed typically gets replaced by a tiny button that brings it back. 
Also it tends to be marked in the timeline; HTML parsing doesn't give us 
that. And there's the whole issue that advertising is generally not 
considered optional by publishers, whereas timed tracks are explicitly 
intended to be removable by the user.

> On Sun, 25 Jul 2010, Silvia Pfeiffer wrote:
> > >
> > > I think if we have a mixed set of .srt files out there, some of 
> > > which are old-style srt files (with line numbers, without WebSRT 
> > > markup) and some are WebSRT files with all the bells and whistles 
> > > and with additional external CSS files, we create such a mess for 
> > > that existing ecosystem that we won't find much love.
> >
> > I'm not sure our goal is to find love here, but in general I would 
> > agree that it would be better to have one format than two. I don't see 
> > why we wouldn't just have one format here though. The idea of WebSRT 
> > is to be sufficiently backwards-compatible that that is possible.
> 
> With "finding love" I referred to your expressed goals:
>  - Keep implementation costs for standalone players low.
>  - Use existing technologies where appropriate.
>  - Try as much as possible to have things Just Work.
> 
> With WebSRT, we will have one label for two different types of files: 
> the old-style SRT files and the new WebSRT files. Just putting a single 
> label on them doesn't mean it is one format, in particular when most old 
> files will not be conformant to the new label and many new files will 
> not play in the software created for the old spec.

Fair enough. I've changed this.

> > > You mention that karaoke and lyrics are supported by WebSRT, so 
> > > could we add them to the track kinds?
> >
> > Why would they need new script kinds? Isn't "subtitles" enough?
> 
> Interesting idea.
> 
> This actually gets back to the issue that I have mentioned before: we are
> actually overloading the meaning of the @kind attribute with many different
> things:
> * what the data is semantically: subtitle, caption, textual description,
> chapters or "metadata" (i.e. "anything")
> * whether the data will be visually displayed
> * how the data will be parsed

Aren't the second and third bullet points the same as the first? I don't 
understand how they're distinct.

kind="" is just setting the kind of track: whether it's something that 
gets put on the video (subtitles, captions), or something that gets 
displayed as a jump list (chapters), etc.

> What if, from a semantic viewpoint, people want to have subtitles or 
> captions always show, but not karaoke or lyrics?

I don't understand the question. Can you elaborate?

> > We could provide an API dedicated to making it easier to render cues 
> > manually if desired (firing an event or callback with the actual cue 
> > for each cue that shows, for example).
> 
> I think that might be a good idea. How would you suggest? Is the 
> oncuechange not sufficient?

TimedTrack.oncuechange is sufficient, just not necessarily the most 
convenient solution we could use. It would be helpful to see exactly how 
people render cues before adding a better feature.

> > > And what if we wanted to render captions underneath a video rather 
> > > than inside video dimensions? Can that be achieved somehow?
> >
> > You'd need to script it, currently. (I didn't see many (any?) cases of 
> > this in my research, so I didn't provide a declarative solution.)
> 
> I've seen it done often on the Web, in particular for descriptions (or 
> timed transcripts) - it won't appear on TV or desktop caption 
> applications though, for obvious reasons.
> 
> For example, the descriptions on TED are rendered into a container that 
> is not overlayed onto the video: e.g. 
> http://www.ted.com/talks/dan_cobley_what_physics_taught_me_about_marketing.html 
> (click the interactive transcript on the right to display it).

As far as I can tell that's not subtitles, that's just a transcript with 
hyperlinks. You wouldn't do that with a subtitle format, it's just a 
series of <a> elements with onclick handlers to move the video's playback 
position. No? It's unclear to me how we could make a feature that 
supported this declaratively without making it so narrow in purpose that 
it would fail to hit the 80% bar.

> Or the interactive transcript on youtube is timed text that is not 
> rendered on top of the video but in a box underneath: e.g. 
> http://www.youtube.com/watch?v=nF3yhZrtLRw

The style seems pretty specific to YouTube, here. I'm not sure how we'd 
make a generic version of this.

In general I would view out-of-frame subtitles the same way as we view 
out-of-frame controls: we provide a default UI, and page authors are 
welcome to script the element to provide a site-specific experience.

> For captions and subtitles 
> it's less common, but rendering it underneath the video rather than on 
> top of it is not uncommon, e.g. 
> http://nihseniorhealth.gov/video/promo_qt300.html or 

Conceptually, that's in the video area, it's just that the video isn't 
centered vertically. I suppose we could allow UAs to do that pretty 
easily, if it's commonly desired.

> http://www.fs.fed.us/greatestgood/film/moviefiles/TheGreatestGood_Tr_C_L.mov 

Same.

> http://www.veotag.com/player/Default.aspx?mode=sample&sid=1&pid={516D49AA-72F4-4DA6-91BA-6D225C2782D8}

I couldn't find subtitles there. There were chapter markers, though, is 
that what you meant? I think it's clear that chapter marker UI is often 
specialised to the point where there'd be no sane way to expose it in 
markup without script, so I'm happy to just leave that up to the authors, 
much like how we are leaving the "play" button up to the author when the 
author wants to do something more fancy than the default.

> > > For linking out of a cue, there is a need to allow having hyperlinks 
> > > in cues. IIUC this is currently only possible by using a HTML-style 
> > > markup in the cue, declaring the cue as kind=metadata and calling 
> > > getCueAsSource() on the cue, then running your own overlays and 
> > > shoving the retrieved text to the innerHTML of that overlay.
> >
> > Having a hyperlink in a cue seems like really bad UI (having any 
> > temporal interactive UI is typically highly inaccessible, and is 
> > generally only considered a good idea in games). If you want to make 
> > the whole video into a link (as Dave suggested in the e-mail above, if 
> > I understood it correctly) then you don't need anything to do with 
> > timed tracks.
> 
> You can always pause the presentation to follow a given hyperlink.

That's pretty bad UI though.

> It's definitely better than having to re-type a URL, which is what is 
> currently happening in many of the timed annotations in YouTube that 
> leave YouTube.

Just put the link below the video.

> I see the need to support hyperlinks in cues as really important for 
> accessibility and usability reasons.

I see the avoidance of hyperlinks in cues as really important for 
accessibility and usability reasons.

> > > While that works, it seems like a lot of hoops to jump through just 
> > > to be able to use a bit of HTML markup - in particular having to run 
> > > your own overlay. Could we introduce a kind=htmlfragment type where 
> > > it is obvious that the text is HTML and that the fragment parser can 
> > > be run automatically and display it through the given display 
> > > mechanisms?
> >
> > I would on the contrary think that that would be something we should 
> > _discourage_, not encourage!
> 
> All that is going to achieve is that we will end up with HTML fragments 
> in metadata type cues and have to deal with them through JavaScript. I'd 
> much prefer we have a defined way of dealing with this situation rather 
> than having it be created inconsistently in JS libraries.

It's unclear to me what use cases would be served by this. All the cases I 
can think of (such as those discussed earlier in this e-mail: advertising, 
annotations, fancy chapter navigation, interactive transcripts...) are all 
things for which there will be script anyway, and for which built-in 
features won't be used. We would just be providing hugely complex features 
for which there is no benefit.

> > > Many existing subtitle formats and similar media-time-aligned text 
> > > formats contain file-wide name-value pairs that explain metadata for 
> > > the complete resource. An example are Lyrics files, e.g.
> > >
> > > On Tue, 20 Apr 2010, Silvia Pfeiffer wrote:
> > > >
> > > > Lyrics (LRC) files typically look like this:
> > > >
> > > > [ti:Can't Buy Me Love]
> > > > [ar:Beatles, The]
> > > > [au:Lennon & McCartney]
> > > > [al:Beatles 1 - 27 #1 Singles]
> > > > [by:Wooden Ghost]
> > > > [re:A2 Media Player V2.2 lrc format]
> > > > [ve:V2.20]
> > > > [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love,
> > > > <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60>
> > > > [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love,
> > > > <00:08.00>love<00:08.90>
> > >
> > > You can see that there are title, artist, author, album, related 
> > > content, version and similar metadata information headers on this 
> > > file. Other examples contain copyright information and usage rights 
> > > - important information to understand and deal with when 
> > > distributing media-time-aligned text files on a medium such as the 
> > > Web.
> >
> > I don't really see why we would want to embed this in a timed track. 
> > Even in HTML embedding this kind of information has never taken off. 
> > We would need to have very compelling use cases, implementation 
> > experience, and implementation committements to move in such a 
> > direction, IMHO.
> 
> Dublin Core has been a huge success.

I think our definitions of "success" are at odds. :-)

> Every archive in the world uses that kind of metadata. I am confused 
> what you mean by metadata in HTML hasn't taken off. I believe it's only 
> search engines that stopped using metadata and only because people 
> started mis-using the system.

Metadata in HTML on the Web is rare and when used is uniformly bogus.
Why would this be different for metadata in captions on the Web?

> Such metadata is also relevant to audio and video, just look at the 
> success of ID3 tags or Vorbis Comment. Similarly, we will need this 
> capability in timed text files.

For audio and video tracks I think there is success (mostly in commercial 
files), but that's out of scope of HTML.

> > > I would think it'd be good to define a standard means of extracting 
> > > plain text out of any type of cue, so it will be possible to hand 
> > > this to e.g. the accessibility API for reading back.
> >
> > Getting the raw data is already possible, unless I misunderstood what 
> > you meant.
> 
> What I meant is to have a getter in TimedTrackCueList that will not 
> return the cue with its specific markup (WebSRT, JSON or HTML fragment), 
> but stripped off any of the special markers. This can be very 
> interesting when wanting to shoot something through to speech 
> recognition or so.

That seems like a generic-purpose feature, not a caption-specific one. We 
should address this at the platform level.

> > > I think by understanding this and by making this explicit in the 
> > > spec, we can more clearly decide what track kinds are still missing 
> > > and also what we actually need to implement.
> >
> > I'm not sure what to add to make this clearer. Can you elaborate?
> 
> What I meant by this was that in section 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#attr-track-kindwhere 
> @kind is introduced, there is no mention about the implications of 
> choosing between these @kind attributes. E.g. if I chose a 
> "description", then it will not be visible unless I implement that in 
> JavaScript - that is a pretty big implication that I only found out when 
> I finally got to reading the rendering section. Also, that section does 
> not provide any hint on what type of markup will be expected in the cue 
> text - I think that is also a pretty big implication that should be 
> mentioned in that section.

Ok, I've added some explanatory text.

> > [we can extent the syntax later to add metadata, defaults, etc]
> When I read the following: [...] then that doesn't imply for me that we 
> can add anything in front of the WebSRT cues without breaking the spec, 
> or that we can define cues that are not time ranges around the "-->" 
> sign.

Different "we"s. I meant that a future version of the language could 
support this while being compatible with the first generation of UAs, not 
that authors could do it today.

> > > and we know from image, music and video resources how important it 
> > > is to have the ability to keep such metadata inside the resource.
> >
> > Do we? I thought from image, music, and video we learnt that it didn't 
> > make much difference! :-)
> 
> I think ID3 is very successful, in particular in iTunes, see 
> http://en.wikipedia.org/wiki/ITunes#File_metadata . The vorbiscomment 
> header on Xiph files enjoys a similar popularity. And the huge success 
> of EXIF for images - written by every single digital photo camera and 
> used by every single photo application. They make a huge difference.

EXIF is automated and only contains machine-producable data. Metadata in 
iTunes tracks is probably required by contract, there is a high financial 
motivation for it to be of good quality, and the number of people writing 
it is limited. I haven't studied Xiph files so I can't speak about those.

On the Web, though, images in general don't have metadata, and the Web has 
not suffered for it. (EXIF is used on photographs, nor arbitrary Web 
images.) There's not really much music on the Web, so I don't know that we 
can learn much from that, but video on the Web has very little metadata as 
far as I can tell, and seems to have done fine nonetheless.

> > > * there is no style sheet association for a WebSRT resource; this 
> > > can be resolved by having the style sheet linked into the Web page 
> > > where the resource is used with the video, but that's not possible 
> > > when the resource is used by itself. It needs something like a 
> > > <link> to a CSS resource inside the WebSRT file.
> >
> > Do standalone SRT players want to support CSS? If not, it doesn't much 
> > matter.
> 
> Stand-alone SRT players wouldn't want to see any of the WebSRT 
> extensions. Stand-alone WebSRT players - if we define the styling to be 
> in CSS - would probably want to support whatever is in WebSRT - if that 
> includes CSS, then that's it. But any of this is just guesswork until we 
> have implementations.

Then we can wait until we have implementations. If it turns out that they 
provide a way to specify a CSS file, then we should support it.

> > > * there is no magic identifier for a WebSRT resource, i.e. what the 
> > > <wmml> element is for WMML. This makes it almost impossible to 
> > > create a program to tell what file type this is, in particular since 
> > > we have made the line numbers optional. We could use "-->" as an 
> > > indicator, but it's not a good signature.
> >
> > Yeah, that's a problem. I considered adding "WEBSRT" at the start of 
> > every file but we couldn't use it reliably since WebSRT parsers 
> > presumably want to support SRT using the same parser, and that has no 
> > signature.
> 
> I continue to doubt that you can support WebSRT without changing your 
> SRT parser. Thus, you might as well make such a change and make it easy 
> for SRT parsers to identify that it's a WebSRT file to parse and not 
> legacy SRT.

Fair enough.

> > (Note that XML, and anything based on XML, as well as HTML, JS, and 
> > CSS, have no signature either. It's a common problem of text formats.)
> 
> Well, there are typical things to parse at the head of XML files, such as
> processing instructions or
> <!DOCTYPE html>
> <html
> These *are* magic identifiers.

No, they're not. They're unreliable heuristics. A magic identifier for a 
file type is something that is always present in files of that type and 
can be unambiguously used to determine the file type.

> > > * there is no means to identify which parser is required in the cues 
> > > (is it "plain text", "minimal markup", or "anything"?) and therefore 
> > > it is not possible for an application to know how it should parse 
> > > the cues.
> >
> > Timed track cues are not context-free. In standalone players, the user 
> > says to play a particular cue file, so using the "cue text" mode is a 
> > good assumption (why would you give mplayer a metadata cue file to 
> > display?).
> 
> Because it is a .srt file and thus assumed to be supported by mplayer.

I don't understand. Users don't just randomly find SRT files and feed them 
to their players. They find a video and then seek out the approriate 
subtitle file and hand that to the player.

> > > I can understand that the definition of WebSRT took inspiration from 
> > > SRT for creating a simple format. But realistically most SRT files 
> > > will not be conformant WebSRT files because they are not written in 
> > > UTF-8.
> >
> > I don't think they need to be conforming. They're already published. 
> > Conformance is just a quality assurance tool, it's only relevant for 
> > documents being written in the future.
> 
> Conformance is also a problem if players and other tools do not accept 
> files that are not conformant. I would think Web browser will be highly 
> restrictive in what they accept - otherwise the spec isn't quite so 
> useful and we are starting to do quirks again.

The spec requires specific behaviour in the face of non-conforming 
content, and that behaviour is not to be restrictive.

> > > Right now, there is "plain text", "minimum markup" and "anything" 
> > > allowed in the cues.
> >
> > As far as I can tell there's just two modes -- plain text and text 
> > with WebSRT markup.
> 
> @kind=metadata tracks can have "anything" in them, which is what I 
> regarded as the third type of markup.

That's the same as "plain text".

> > > Seeing as WebSRT is built with the particular purpose of bringing 
> > > time-synchronized text for HTML5 media elements, it makes no sense 
> > > to exclude all the capabilities of HTML.
> >
> > I would on the contrary say that it makes no sense to take on all the 
> > HTML baggage when all we want to do is introduce subtitles to video. 
> 
> We are introducing functionality for text and events that are executed 
> in a time-synchronized manner with media elements - this is broader than 
> just subtitles.

While I'm willing to grant that we're doing a bit more than subtitles, I 
disagree that we're introducing "functionality for text and events that 
are executed in a time-synchronized manner with media elements". That 
describes something like SMIL, significantly more complexity than the 
narrow set of use cases which this effort is intended to address.

Now it may be that there are use cases whose value we can debate, such as 
advertising, or rich-media-on-rich-media annotations, or declarative 
interactive styled chapter UI, that are not currently considered in-scope, 
but if we want to add new use cases then we should consider them as new 
use cases and figure out from scratch how to address them (which may or 
may not involve reusing the same infrastructure as we are using for 
subtitles). We should not just throw in solutions on the assumption that 
those will address those new use cases.

> > > In the current form, WebSRT only makes limited use of existing CSS. 
> > > I see particularly the following limitations:
> > >
> > > * no use of the positioning functionality is made and instead a new 
> > > means of positioning is introduced; it would be nicer to just have 
> > > this reuse CSS functionality. It would also avoid having to repeat 
> > > the positioning information on every single cue.
> >
> > It doesn't make sense to position cues with CSS, because the position 
> > of cues is an intrinsic part of the cue semantic. Where a cues appears 
> > can change the plot of a show, for example (was it the evil twin who 
> > said something or the good twin?).
> 
> When I say "CSS" I mean the CSS means of providing in-line @style
> information. That is just a different means of providing positioning and
> styling information in a cue.

I don't understand what problem this would solve.

Note that CSS positioning doesn't provide the primitives needed to do cue 
overlap avoidance while still having positioned cues (which the HTML
spec does currently support for timed tracks).

> > > * cue-related metadata ("voice") could be made more generic; why not 
> > > reuse "class"?
> >
> > I don't know what this means. What is "class" and how does it differ 
> > from "voice"?
> 
> I am talking about the @class attribute in use by all HTML elements. It 
> could be used with a <span> to provide voice metadata and it would be 
> more flexible than "voice" because it can be associated with text 
> fragments, not with whole lines of text.

I could see value for a generic class mechanism (so I've added one), but I 
don't see how this relates to voices.

> > > * I noticed that it is not possible to make a language association 
> > > with segments of text and thus it is not possible to have text with 
> > > mixed languages.
> >
> > Are mixed language subtitles common? I don't know that I've ever seen 
> > that.
> 
> I have seen several caption files that have at least two languages, 
> possibly even in the same cue. You even have some at 
> http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA.

That can be done fine with the spec as written. You don't need to annotate 
the language in captions as far as I can tell (see also the discussion 
earlier in this e-mail).

> > * Is it possible to reuse the HTML font systems?
> >
> > What is the HTML font system?
> 
> Basically stuff defined here: 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#fonts-and-colors

I don't understand what it would mean to re-use that.

> > On Sat, 21 Aug 2010, Silvia Pfeiffer wrote:
> > >
> > > It's not just about implementation cost - it's also the problem of 
> > > maintaining another spec that can grow to have eventually all the 
> > > features that HTML5 has and more. Do you really eventually want to 
> > > re-spec and re-implement a whole innerHTML parser plus the extra <t> 
> > > element when we start putting <svg> and <canvas> and all sorts of 
> > > other more complex HTML features into captions? Just because the <t> 
> > > element is making trouble now? Is this really the time to re-invent 
> > > HTML?
> >
> > No, it's not. We should never let subtitles get that crazy.
> 
> Hmm, where have I heard that said before ...
> http://www.ibiblio.org/pioneers/lee.html
> "Berners-Lee was concerned over some of the new directions the Web was
> taking. There were decided differences between his original vision and the
> visions of Andreesen and the Netscape crowd. The Web was designed to be a
> serious medium."
> I think it's a myth to believe one has control over the path a technology
> will take and in which way it will be used.

If we have no control then it doesn't matter what the spec says. If we 
have any control then we should do what we can to not let subtitles get 
crazy. I don't see a contradiction here.

> On Mon, 23 Aug 2010, Philip Jägenstedt wrote:
> > >
> > > I don't expect that SVG, <canvas>, images, etc will ever natively be 
> > > made part of captions. Rather, I would hope that the metadata state 
> > > together with scripts is used. If we think that e.g. images in 
> > > captions are an important use case, then WebSRT is not a good 
> > > solution.
> >
> > Indeed.
> 
> Images in captions will be used, I can guarantee that.

The question isn't whether they'll be used, but whether the use cases for 
them are significant enough that we should support this case natively.

> > > If we allow arbitrary HTML and expect browsers to handle it well, it 
> > > adds some complexity. For example, any videos and images in the cue 
> > > would have to be fully loaded and ready to be decoded by the time 
> > > the cue is to be shown, which I really don't want to implement the 
> > > logic for. Simply having an iframe-like container where the document 
> > > is replaced for each cue wouldn't be enough, rather one would have 
> > > to create one document per cue during parsing and wait for all of 
> > > those to finish loading before beginning playback. I'm not sure, but 
> > > I'm guessing that amounts to significant memory overhead.
> >
> > Quite.
> 
> People will do it with HTML in the metadata and then decode it through 
> JavaScript and throw it at a the HTML fragment parser, including all the 
> side effects that may have and that they will have to deal with. I'm 
> sure this will eventually catch up with us. Would it not be better to 
> think about it now and address it - in particular if you are saying that 
> WebSRT is not the right solution for this?

What are the use cases? Are they significant? If so, let's design 
something for them directly.

> > On Tue, 24 Aug 2010, Silvia Pfeiffer wrote:
> > >
> > > I believe [SVG etc] will be [added to WebSRT]. But since we are only 
> > > looking at the ways in which captions and subtitles are used 
> > > currently, we haven't accepted this as an important use case, which 
> > > is fair enough. I am considering likely future use though, which is 
> > > always hard to argue.
> >
> > In all my research for subtitles, I found very few cases of anything 
> > like this. Even DVDs, whose subtitle tracks are just hardcoded bitmap 
> > images, don't do anything fancy with them... just plain text and 
> > italics, generally. Why haven't people started doing fancy stuff with 
> > subtitles in all the years that we've had TVs? It's not like they 
> > can't do it.
> 
> SVG on the TV? All that was possible was teletext type graphics and 
> indeed, people did a lot of graphics there, e.g. 
> http://www.google.com.au/images?q=teletext .

Not in subtitles though.

> > My guess is that the real reason is that when you get so fancy that 
> > you're including graphics and the like, you're no longer doing timed 
> > tracks, you're just doing content, and the right thing to do is to 
> > either burn it in, or consider it a separate construct animated on top 
> > of the video, e.g. an <svg:video> and SMIL.
> 
> There was no authoring format available for such things that anything 
> would support to display.

That's not true. DVDs for example just have bitmap cues, so any graphic 
is possible. Yet much more than 80% of DVD subtitles are just text.

> Even the more complex caption formats were really not supported in any 
> player.

Shouldn't that be an indicator that there isn't actually a need here?

> Putting it on the Web is a game changer. It will be easy to author 
> (plenty of people know how to author HTML and will be able to throw HTML 
> fragments into WebSRT cues) and it will be easy to display (using some 
> JavaScript and the framework we're putting in place).

All the more reason not to encourage it! We should make desireable things 
easy, and undesireable things hard. That's a big part of language design.

> > > Then a *playback application* has the chance to identify them as a 
> > > different format and provide a specific parser for it, instead of 
> > > failing like Totem. They can also decide to extend their existing 
> > > SRT parser to support both WebSRT and SRT. And I also have no issue 
> > > with a user deciding to give a WebSRT file a go by renaming it to 
> > > .srt.
> >
> > I think you think there's more difference between WebSRT and SRT than 
> > there is. In practice, there is less difference between WebSRT and the 
> > equivalent SRT file than there is between two random SRT files today. 
> > The difference between WebSRT and SRT is well within the "error bars" 
> > of what SRT is today.
> 
> A WebSRT file with JSON in the cues is more different to anything that 
> is called .srt today.

Nobody is going to be passing metadata timed tracks to standalone players, 
so I don't see why that's relevant.

> An authoring application that loads a WebSRT file should support all 
> features of WebSRT, even the metadata type and should know what to do 
> with it.

Which is what, exactly? I don't understand what it means to use a metadata 
timed track file outside of a Web browser.

> If such a file is clearly marked as .wsrt, the authoring application has 
> a chance to do the right thing with the file and allow you to continue 
> editing your JSON content in a special interface for it.

A cue editor can just have a user toggle to decide what kind of editor it 
should expose.

On Fri, 10 Sep 2010, Philip JÃ¤genstedt wrote:
> 
> Not being convinced we need anything more than simple key-value headers 
> in a header, I still looked at the options for comments:
> 
> Making any line with a --> in it be a comment would hide a lot of broken 
> cues from validators, so I think we shouldn't do this.
> 
> ; appears at the beginning of lines in 15/10000 files and most don't 
> look like they're intended as comments.
> 
> # appears at the beginning of lines in 244/10000 files and most don't 
> look like they're intended as comments.
> 
> /* only appears in 3/10000 files, so CSS-style comments might work, but 
> does add some complexity
> 
> // appears at the beginning of lines in 5/10000 files and most look like 
> that *are* intended as comments or are garbage, so it should work.
> 
> (data from OpenSubtitles sample)

Thanks for the data.

If we're going to change the spec to require a magic header, then the 
legacy content is of limited importance at this point, so I guess it 
doesn't really matter. We can add any kind of backwards-compatible new 
syntax in the future.

> I often see various credits in the cues themselves. Some are there for 
> ego purposes, but I expect at least some of them would end up in a 
> metadata field if it existed. It's hard to get solid numbers, but after 
> some grepping and manual filtering it seems like around 5% of files have 
> some form of credits matching 'subtitle', 'translat' or 'caption' 
> case-insensitively. I guess that many non-English subtitles have the 
> credits in another language, so the true percentage should be higher.

It's pretty common for subtitles to include visible credits around the 
same time as programme credits. I don't see any reason to believe that 
having metadata for this would be superior. HTML authors for example 
typically put their name in their HTML page rather than using the various 
metadata mechanisms (or use both). Historically in HTML the metadata 
features haven't been especially useful. I don't see why it would be 
different for subtitle data.

On Sat, 11 Sep 2010, Silvia Pfeiffer wrote:
> 
> What I meant was: if I author a text track that is supposed to be 
> visible on screen as the video plays back and if we choose either 
> @kind=subtitle or @kind=caption as the default, then I don't have to 
> really think through about what I authored as it will be displayed on 
> screen. This invites people to not distinguish between whether they 
> authored subtitles or captions, which is a bad thing, because a deaf 
> user may then get tracks with the wrong label and expectations. If, 
> however, we choose as a default something that is not visible on screen, 
> e.g. @kind=description or @kind=metadata, then the author who wants 
> their text track to be visible on screen has to give it a label, i.e. 
> make an explicit choice between @kind=subtitle and @kind=caption. I 
> believe this will lead to more correctly labeled content. I am therefore 
> strongly against default labeling with either subtitle or caption. We 
> could make @kind a required attribute instead as you are saying.

The history of the Web teaches us that if we require that they pick 
between subtitle and caption, they'll just pick at random.

On Mon, 13 Sep 2010, Philip JÃ¤genstedt wrote:
> 
> OK, I think we mostly agree. Any default will sometimes be wrong, so to 
> not have to choose between subtitles and captions, I'd still really 
> prefer if specific HoH-tags like <sound> can be shown or hidden 
> depending on user preference. I think that would lead to more content 
> actually being written for HoH users, as it doesn't requiring 
> maintaining 2 different files.

My main concern with this is that it's not been done before. Given the 
apparent simplicitly of the feature, it seems that there must be some 
reason for this. One reason might be that it's never come up before. 
Another reason might be that captions and subtitles are different in 
subtle ways that make this inappropriate in practice. I think before we 
add this feature, we should try to understand the history here.

Note that even if we were able to point to a single file for both 
subtitles and captions, we'd still have to specify them separately in the 
markup, so that the UI could correctly identify the two options without 
having to download and parse the two tracks first. So I'm not sure it 
would actually help with the problem in question (the default kind).

> Requiring UTF-8 and not requiring UTF-8 both has its downsides. I think 
> that handling charset as an attribute on <track> isn't very difficult, 
> but if there are SRT-incompatible changes for other reasons (e.g. a 
> header) then I think we should go back to always requiring UTF-8.

Done.

> I don't suppose it's a huge problem in practice that errors can't be 
> detected until EOF, but it's certainly not a desirable feature. To 
> maintain some sanity, we probably ought to either require the correct 
> MIME type or require the correct magic bytes. From the <video> MIME type 
> debacle, I think I slightly prefer magic bytes to be checked by the 
> parser.

Done.

On Tue, 14 Sep 2010, Philip JÃ¤genstedt wrote:
> 
> I'd say that the simplest approach is probably requiring the first line to be
> "WebSRT", and then all lines up to the first blank line are defined as the
> header.

Done, though currently anything in the header is ignored and invalid.

On Tue, 14 Sep 2010, Anne van Kesteren wrote:
> 
> Apart from text/plain I cannot think of a "web" text format that does 
> not have comments.

But what's the use case? Is it really useful to have comments in a 
subtitle file?

On Fri, 22 Oct 2010, Philip JÃ¤genstedt wrote:
> 
> However, UTF-8 does complicate the magic header a bit due to the 
> possibility of a BOM. While it would be nice to forbid the use of a BOM, 
> I expect we'd then see lots of frustration from authors who's editors 
> automatically insert it...

I think having an optional BOM before the magic string is fine. It just 
means there's two magic strings, basically. (There's actually strictly 
eight; see the spec for details. I list them in the MIME type registration 
for completeness.)

On Fri, 22 Oct 2010, Simon Pieters wrote:
> 
> Do you think browsers will support vanilla SRT as well? If yes, then 
> making WebSRT incompatible seems like doing the quirks mode/standards 
> mode mistake again to me (and eventually we'll have to specify vanilla 
> SRT anyway, but are also stuck with yet another format to support).

My assumption is that most browsers will not support legacy SRT. If they 
were to do so, we'd have to spec it first, anyway.

> > It can still be inspired by it though so we don't have to change much. 
> > I'd be curious to hear what other things you'd clean up given the 
> > chance.
> 
> WebSRT has a number of quirks to be compatible with SRT, like supporting 
> both comma and dot as decimal separators, the weird parsing of 
> timestamps, etc.

I've cleaned the timestamp parsing up. I didn't see others.

On Fri, 22 Oct 2010, Philip JÃ¤genstedt wrote:
> 
> We should just remove charset="" from the spec.

Done.

On Fri, 5 Nov 2010, Silvia Pfeiffer wrote:
> 
> seeing the addition of the <bdi> element into HTML, we probably also 
> need to add that to WebSRT cue level markup to allow bidirectional text 
> formatting. 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-bdi-element

Why can't we just use bidi formatting characters, like in text/plain?

On Fri, 10 Sep 2010, Eric Carlson wrote:
>
> "type"  will definitely be necessary if we use <track> for other media 
> types, eg. for sign language video, descriptive audio, etc.

Those aren't text tracks, so presumably wouldn't use <track>.

> Images are already commonly used in chapter menus.

Could you elaborate on this?

On Wed, 20 Oct 2010, Odin Omdal HÃ¸rthe wrote:
> 
> The standards-loving Agency for Public Management and eGovernment here 
> in Norway are getting their eyes up for HTML5 video (like the rest of 
> the world), and are kicking the tires. I've been streaming many 
> conferences with Ogg Theora and using Cortado as fallback for legacy 
> browsers (+Safari).
> 
> Now it has come to a point that we are required to follow the WAI WACG 
> requirements. So we have to caption the live video streams/broadcasts.
> 
> Given the (not surprising) low support of Timed Tracks for live streams 
> in browsers, I'm at this point going to burn the text into the video to 
> be shown. However, that is no good solution long term. When browsers 
> implement the new startOffsetTime I will be able to send the text via a 
> WebSocket to Javascript and have it synced to the video (along with the 
> slide images).
> 
> However, it would be very nice to be able to send this to the 
> caption-track, and not having to reimplement a user interface for 
> choosing to see captions etc (I guess user agents will have that). Also, 
> I guess there will also be other benefits of streaming directly as a 
> timed track, such as the user agent knowing what it is (so that it can 
> do smart things with it).

The API for timed tracks allows for this.

> Or what other way is there to text such live conferences; or even bring 
> real-time metadata from a live video?

Video formats typically support in-band subtitles, which I would expect is 
what would be used for subtitles for like streams.

On Tue, 5 Oct 2010, Philip JÃ¤genstedt wrote:
> 
> At the Open Subtitles Design Summit, there was some discussion about 
> captioning for the HoH. I've already put this input into a related bug 
> [2], but to summarize: The default rendering for the voices syntax 
> should probably be to prefix the text cue with the name of the speaker, 
> not to do anything funny with colors or positioning. What's less clear 
> is if it's annoying to always prefix with the speaker, or if it should 
> be done only to disambiguate.
> [2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320

The bug is about the syntax, nor the rendering.

The syntax suggestion makes sense, and I've updated the spec accordingly.

I'm not sure what to do with the rendering. Could you elaborate on why we 
would show the speaker names? Surely it would be better to let authors 
explicitly put in the speaker names in the text of the cue if that's what 
they want.

> At FOMS we had a session on WebSRT which was extremely helpful. It turns 
> out that SRT has more syntax variations than we had thought, kindly 
> pointed out by VLC developer j-b. Even though there is no SRT spec, 
> there is a test suite of sorts that I had never seen before. I'll call 
> SRT which follows the syntax implied by these tests ale5000-SRT. Apart 
> from the HTML-like markup we knew about, ale5000-SRT also has various 
> markup on the form {...} which was borrowed from SSA, as well as \h and 
> \N for "hard space" and line break respectively. Also in the crazy 
> department is that tags which aren't matched with an opening and closing 
> tag should be rendered as plain text. Stray < should also just be 
> displayed as text. VLC actually implements most of this, as does 
> VSFilter, which we should have tested but didn't. It would probably be 
> possible to write a spec for ale5000-SRT, but extensibility would be 
> limited to matched opening and closing tags, which doesn't work for the 
> suggested voices syntax. With this mess, I'd rather not extend 
> ale5000-SRT. I can only agree with Silvia that we should make WebSRT 
> identifiable, so that different parsers can be used.

Based on this, and as discussed earlier in this e-mail, I've made a number 
of changes to the language, including:

> * Add magic bytes to identify WebSRT, maybe "WebSRT". (This will break 
> some existing SRT parsers.)
>
> * Make WebSRT always be UTF-8, since you can't reuse existing SRT files 
> anyway.

> * Note that certain ale5000-SRT syntax is not part of WebSRT, so that 
> one doesn't have to debug the parsing algorithm to learn that.

Could you suggest some text for this? It's unclear to me exactly what 
would be helpful here. In fact, noting that the language is no longer 
called "SRT", is this still necessary?

> Styling hooks were requested. If we only have the predefined tags (i, b, 
> ...) and voices, these will most certainly be abused, e.g. resulting in 
> <i> being used where italics isn't wanted or <v Foo> being used just for 
> styling, breaking the accessibility value it has.

I've added <span> for styling.

> There was also some discussion about metadata. Language is sometimes 
> necessary for the font engine to pick the right glyph.

Could you elaborate on this? My assumption was that we'd just use CSS, 
which doesn't rely on language for this.

> License is also an often requested piece of metadata.

That would get solved if we allowed comments, which is how it's solved in 
JS (CSS and HTML don't usually get licensed for some reason). I've punted 
on this for now, so as to not make too many parsing changes at once.

> Finally, some things I think are broken in the current WebSRT parser:
> 
> * Parsing of timestamps is more liberal than it needs to be. In 
> particular, treating the part after the decimal separator as an integer 
> and dividing by 1000 leads to 00:00:00.1 being interpreted as 0.001 
> seconds, which is weird. This is what e.g. VLC does, but if we need to 
> add a header we could just as well change this to make more sane. 
> Alternatively, if we want to really align with C implementations using 
> scanf, we should also handle negative numbers (00:01:-5,000 means 55 
> seconds), octal and hexadecimal.

Fixed.

> * The current syntax looks like XML or HTML but has very different 
> parsing. Voices like <narrator> don't create nodes at all and for tags 
> like <i> the paser has a whitelist and also special rules for inserting 
> <rt>. Unless there are strong reasons for this, then for simplicity and 
> forward compatibility, I'd much rather have the parser create an actual 
> DOM (not a tree of "WebSRT Node Object") that reflects the input. If we 
> also support attributes then people who actually want to use their 
> (silly) <font color=red> tags can do so with CSS. This could also work 
> as styling hooks. Obviously, a WebSRT parser should create elements in 
> another namespace, we don't want e.g. <img> to work inside cues.

I don't think we want to expose an actual DOM, since then people will just 
do things like put <html:video> elements into the DOM, or try to 
document.write() into it, or the like, which is just as bad as doing HTML 
parsing of cues.

> * The "bad cue" handling is stricter than it should be. After collecting 
> an id, the next line must be a timestamp line. Otherwise, we skip 
> everything until a blank line, so in the following the parser would jump 
> to "bad cue" on line "2" and skip the whole cue.
> 
> 1
> 2
> 00:00:00.000 --> 00:00:01.000
> Bla
> 
> This doesn't match what most existing SRT parsers do, as they simply 
> look for timing lines and ignore everything else. If we really need to 
> collect the id instead of ignoring it like everyone else, this should be 
> more robust, so that a valid timing line always begins a new cue. 
> Personally, I'd prefer if it is simply ignored and that we use some form 
> of in-cue markup for styling hooks.

The IDs are useful for referencing cues from script, so I haven't removed 
them. I've also left the parsing as is for when neither the first nor 
second line is a timing line, since that gives us a lot of headroom for 
future extensions (we can do anything so long as the second line doesn't 
start with a timestamp and "-->" and another timestamp).

> * At the beginning of "cue text loop" (step 28) a newline should be 
> collected.

Fixed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'