[whatwg] Timed tracks: feedback compendium

Tue Sep 7 16:19:17 PDT 2010

On Fri, 23 Jul 2010, Sam Dutton wrote:
> >>
> >> The addCueRange() API has been removed and replaced with a feature 
> >> based on the subtitle mechanism. <<
> 
> Do you mean the use of timed track cues?

Yes.

> A couple of minor queries re 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element:
> 
> * 'time track' is referred to a couple of times -- it's not clear why 
> this is used instead of 'timed track'

Fixed.

> * 'the WebSRT file must WebSRT file using cue text' -- I guess this 
> should be 'the WebSRT file must be a WebSRT file using cue text'

Fixed.

On Fri, 23 Jul 2010, Philip Jägenstedt wrote:
> 
> I'm not a fan of pauseOnExit, though, mostly because it seems 
> non-trivial to implement. Since it is last in the argument list of 
> TimedTrackCue, it will be easy to just ignore when implementing. I still 
> don't think the use cases for it are enough to motivate the 
> implementation cost.

Really? It seems like automatically pausing video half-way would be a very 
common thing to do; e.g. to play an interstitial ad, or to play a specific 
sound effect in a sound file containing multiple sound effects, or to play 
a video up to the point where the user has to make a choice or has to ask 
to move on to the next slide. There's basically no good way to do this 
kind of thing without this feature.

> > On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
> > > 
> > > * It is unclear, which of the given alternative text tracks in 
> > > different languages should be displayed by default when loading an 
> > > <itext> resource. A @default attribute has been added to the <itext> 
> > > elements to allow for the Web content author to tell the browser 
> > > which <itext> tracks he/she expects to be displayed by default. If 
> > > the Web author does not specify such tracks, the display depends on 
> > > the user agent (UA - generally the Web browser): for accessibility 
> > > reasons, there should be a field that allows users to always turn 
> > > display of certain <itext> categories on. Further, the UA is set to 
> > > a default language and it is this default language that should be 
> > > used to select which <itext> track should be displayed.
> > 
> > It's not clear to me that we need a way to do this; by default 
> > presumably tracks would all be off unless the user wants them, in 
> > which case the user's preferences are paramount. That's what I've 
> > specced currently. However, it's easy to override this from script.
> 
> It seems to me that this is much like <video autoplay> in that if we 
> don't provide a markup solution, everyone will use scripts and it will 
> be more difficult for the UA to override with user prefs.

What would we need for this then? Just a way to say "by the way, in 
addition to whatever the user said, also turn this track on"? Or do we 
need something to say "by default, override the user's preferences for 
this video and instead turn on this track and turn off all others"? Or 
something else? It's not clear to me what the use case is where this 
would be useful declaratively.

> > On Fri, 31 Jul 2009, Philip Jägenstedt wrote:
> > > 
> > > * Security. What restrictions should apply for cross-origin loading?
> > 
> > Currently the files have to be same-origin. My plan is to wait for 
> > CORS to be well established and then use it for timed tracks, video 
> > files, images on <canvas>, text/event-stream resources, etc.
> 
> If I'm interpreting the track fetch algorithm correctly cross-origin is 
> strictly enforced and treated as a network error. This is different from 
> e.g. <img> and <video>, but it seems to make things simpler, so I'm fine 
> with that. It also ensures that JavaScript fallback handling of <track> 
> won't fail just because of cross-origin in XHR.

Right. The difference between captions and video data is that you can get 
a heck of a lot more data out of a caption file. Similarly, we wouldn't 
expose the captions from cross-origin files to script (oops, I had 
forgotten to block that -- fixed!) without CORS opt-in.

> > > * Complexity. There is no limit to the complexity one could argue 
> > > for (bouncing ball multi-color karaoke with fan 
> > > translations/annotations anyone?). We should accept that some use 
> > > cases will require creative use of scripts/SVG/etc and not even try 
> > > to solve them up-front. Draw a line and stick to it.
> > 
> > Agreed. Hopefully you agree with where I drew the line! :-)
> 
> Actually, I think both karaoke (in-cue timestamps) and ruby are 
> borderline, but it depends on how difficult it is to implement.

FWIW, a lot of the use cases that were found (see the wiki), and that I 
saw over the few weeks that I was doing this, had intracue timing. We 
don't have to have it, obviously, but it doesn't look especially hard to 
do and if we can do it I think it'd be a good win. However, if it is 
indeed more work than it appears to be, then we can totally drop it.

> One thing in particular to note about karaoke is that even with in-cue 
> timestamps, CSS still won't be enough to get the typical effect of 
> "wiping" individual characters from one style to another, since the 
> smallest unit you can style is a single character. To get that effect 
> you'd have to render the characters in two styles and then cut them 
> together (with <canvas> or just clipping <div>s). Arguably, this is a 
> presentation issue that could be fixed without changing WebSRT.

Yes, it seems to me that that is the kind of thing we should add in 
transitions -- making the transition be spatial, so that instead of 
gradually going from red to blue, it snaps from red to blue but does so 
gradually more and more to the right, say.

> > On Thu, 15 Apr 2010, Silvia Pfeiffer wrote:
> > > 
> > > Further, SRT has no way to specify which language it is written in
> > 
> > What's the use case?
> 
> As hints for font selection

Are independent SRT processors really going to do per-language font 
selection? How do they do it today?

> and speech synthesis.

Are independent SRT processors really going to do audio descriptions any 
time soon? I've only ever seen this in highly experimental settings.

> [...] the positioning of individual cues is still not controlled by CSS 
> but rather by e.g. L:50%.

I considered this issue carefully when speccing WebSRT. My conclusion 
(after watching a lot more TV than I'm used to) was that in practice 
subtitle positioning is not strictly a presentational issue -- that is, 
you can't just swap one set of styles for another and have equally good 
results, you have to control the positioning on a per-cue basis regardless 
of the styling. This is because you have to avoid burnt-in text, or 
overlap burnt-in text, or because you need to align text with a speaker, 
or show which audio channel the text came from (e.g. for people talking 
off camera in a very directional sense), etc.

> > > I'm also confused about the removal of the chapter tracks. These are 
> > > also time-aligned text files and again look very similar to SRT.
> > 
> > I've also included support for chapters. Currently this support is not 
> > really fully fleshed out; in particular it's not defined how a UA 
> > should get chapter names out of the WebSRT file. I would like 
> > implementation feedback on this topic -- what do browser vendors 
> > envisage exposing in their UI when it comes to chapters? Just markers 
> > in the timeline? A dropdown of times? Chapter titles? Styled, 
> > unstyled?
> 
> A sorted list of chapters in a context menu at minimum, with the name of 
> the chapter and probably the time where it starts. More fancy things on 
> the timeline would be cool, but given that it's going to look completely 
> different in all browsers and not be stylable I wonder if there's much 
> point to it.

Ok, I've defined that you just grab the raw text to get the chapter name.

> Finally, random feedback:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#attr-track-kind
> 
> The distinction between subtitles and captions isn't terribly clear.
> 
> It says that subtitles are translations, but plain transcriptions 
> without cues for the hard of hearing would also be subtitles.
> 
> How does one categorize translations that are for the HoH?

I've tried to clarify this.

> Alternatively, might it not be better to simply use the voice "sound" 
> for this and let the default stylesheet hide those cues? When writing 
> subtitles I don't want the maintenance overhead of 2 different versions 
> that differ only by the inclusion of [doorbell rings] and similar. 
> Honestly, it's more likely that I just wouldn't bother with 
> accessibility for the HoH at all. If I could add it with <sound>doorbell 
> rings, it's far more likely I would do that, as long as it isn't 
> rendered by default. This is my preferred solution, then keeping only 
> one of kind=subtitles and kind=captions. Enabling the HoH-cues could 
> then be a global preference in the browser, or done from the context 
> menu of individual videos.

I don't disagree with this, but I fear it might be too radical a step for 
the caption-authoring community to take at this point.

> If we must have both kind=subtitles and kind=captions, then I'd suggest 
> making the default subtitles, as that is without a doubt the most common 
> kind of timed text. Making captions the default only means that most 
> timed text will be mislabeled as being appropriate for the HoH when it 
> is not.

Ok, I've changed the default. However, I'm not fighting this battle if it 
comes up again, and will just change it back if people don't defend having 
this as the default. (And then change it back again if the browsers pick 
"subtitles" in their implementations after all, of course.)

Note that captions aren't just for users that are hard-of-hearing. Most of 
the time when I use timed tracks, I want captions, because the reason I 
have them enabled is that I have the sound muted.

On Fri, 23 Jul 2010, Sam Dutton wrote:
>
> Is trackgroup out of the spec?

What is trackgroup?

On Fri, 23 Jul 2010, Henri Sivonen wrote:
> 
> > - A set of rules and processing models to hold it all together.
> 
> Is it intentional that WebSRT doesn't come with any examples?

Only insofar as I prefer to not do non-normative material until such time 
as the normative material is someone stable. :-)

> > - Keep implementation costs for standalone players low.
> 
> I think this should be a non-goal. It seems to me that trying to cater 
> for non-browser user agents or non-Web uses in Web specs leads to bad 
> Web specs. I think by optimizing for standalone players WebSRT falls 
> into one of the common traps for Web specs. I think we should design for 
> the Web (where the rendering is done by browser engines).

I think that would be somewhat arrogant. :-) We can keep implementation 
costs for standalone players low without making bad specs -- we just have 
to _also_ design for browsers. I think WebSRT does pretty well on this 
front, in fact.

> > - Use existing technologies where appropriate.
> [...]
> > - Try as much as possible to have things Just Work.
> 
> I think by specifying a standalone cue text parser WebSRT fails on these 
> counts compared to reusing the HTML fragment parsing algorithm for 
> parsing cue text.

HTML parsing is a disaster zone that we should avoid at all costs, IMHO. I 
certainly don't think it would make any sense to propagate that format 
into anywhere where we don't absolutely have to propagate it.

> Specifying a new parser for turning HTML-like tags into a tree structure 
> that can be used as the input of a CSS formatter fails to reuse existing 
> technologies where appropriate (though obviously we disagree on what's 
> "appropriate").

I agree with the parenthetical. :-)

> > I first researched (with some help from various other contributors - 
> > thanks!) what kinds of timed tracks were common. The main classes of 
> > use cases I tried to handle were plain text subtitles (translations) 
> > and captions (transcriptions) with minimal inline formatting and 
> > karaoke support, chapter markers so that browsers could provide quick 
> > jumps to points in the video, text-driven audio descriptions, and 
> > application- specific timed data.
> 
> Why karaoke and application-specific data? Those both seem like feature 
> creep compared to the core cases of subtitles and captions.

Karaoke is very cheap and addresses a lot of use cases; application- 
specific data is even cheaper and also addreses a lot of other use cases.

> > If we don't use HTML wholesale, then there's really no reason to use 
> > HTML at all. (And using HTML wholesale is not really an option, as you 
> > say above.)
> 
> I disagree. The most obvious way of reusing existing infrastructure in 
> browsers, the most obvious way of getting support for future syntax 
> changes that support attributes or new tag names and the most obvious 
> way to get error handling that behaves in the way the appearance of the 
> syntax suggests is to reuse the HTML fragment parsing algorithm for 
> parsing the cue text.

HTML parsing is one of the most convoluted, quirk-laden, unintuitive and 
expensive syntaxes... Its extensibility story is a disaster (there's so 
many undocumented and continually evolving constraints that any addition 
is massively expensive), its implementation drags with it all kinds of 
crazy dependencies on the DOM, event loop interactions, scripting, and so 
forth, and it has a highly inconsistent syntax.

I'm not at all convinced reusing it would be "obvious".

> > I've defined some CSS extensions to allow us to use CSS with SRT.
> 
> The new CSS pseudos would be unnecessary if each cue formed a DOM by 
> parsing "<!DOCTYPE html>" as HTML (to get a skeleton DOM in the 
> standards mode) and then document.body.innerHTML were set to the cue 
> text.

You'd still need the past/future pseudos, and a way to jump into the cue 
from the regular DOM.

> > It would also result in some pretty complicated situations, like 
> > captions containing <video>s themselves.
> 
> If the processing is defined in terms of nested browsing contexts, the 
> task queue and innerHTML setter, the "right" behavior falls out of that.

That has not worked out so well for us in the past. (Just ask roc how well 
the behaviour of combining <iframe>s and SVG transforms fell out of 
defining things in terms of previously-implemented constructs.)

On Sun, 25 Jul 2010, Silvia Pfeiffer wrote:
> 
> I think if we have a mixed set of .srt files out there, some of which 
> are old-style srt files (with line numbers, without WebSRT markup) and 
> some are WebSRT files with all the bells and whistles and with 
> additional external CSS files, we create such a mess for that existing 
> ecosystem that we won't find much love.

I'm not sure our goal is to find love here, but in general I would agree 
that it would be better to have one format than two. I don't see why we 
wouldn't just have one format here though. The idea of WebSRT is to be 
sufficiently backwards-compatible that that is possible.

On Mon, 26 Jul 2010, Silvia Pfeiffer wrote:
> > On Thu, 16 Jul 2009, Silvia Pfeiffer wrote:
> >> * the "type" attribute is meant to both identify the mime type of the 
> >> format and the character set used in the file.
> >
> > It's not clear that the former is useful. The latter may be useful; I 
> > haven't supported that yet.
> 
> If the element is to support a single format in a single character set, 
> then there is no need for a MIME type. So, we need to be clear whether 
> we want to restrict our option here for multiple formats.

As specified the spec supports multiple formats, it just only talks about 
WebSRT currently. (If it becomes likely that browsers will have different 
sets of supported formats, we can add a type="" attribute to help browsers 
find the right files without checking each one, but that's not necessary 
unless that becomes a likely problem.)

> >> The character set question is actually a really difficult problem to 
> >> get right, because srt files are created in an appropriate character 
> >> set for the language, but there is no means to store in a srt file 
> >> what character set was used in its creation. That's a really bad 
> >> situation to be in for the Web server, who can then only take an 
> >> educated guess. By giving the ability to the HTML author to specify 
> >> the charset of the srt file with the link, this can be solved.
> >
> > Yeah, if this is a use case people are concerned about, then I agree 
> > that a solution at the markup level makes sense.
> 
> If we really are to use WebSRT because (amongst other reasons) it allows 
> reuse of existing srt files, then we need to introduce a means to 
> provide the charset, since almost none of the srt files in the wild that 
> I have looked at were in UTF-8, but in all sorts of other character 
> sets. Another solution to this problem would be to have WebSRT know what 
> charset their characters are in - then we don't need to add such 
> information to the <track> element. It will still not work with legacy 
> SRT files though.

I've added a charset="" attribute to allow authors to provide the 
character encoding for legacy SRT files. WebSRT files are required to be 
UTF-8, however (legacy SRT files that are not UTF-8 are considered 
non-conforming).

> You mention that karaoke and lyrics are supported by WebSRT, so could we 
> add them to the track kinds?

Why would they need new script kinds? Isn't "subtitles" enough?

> In the proposal at 
> http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations a @media 
> attribute was suggested. The idea is that the @media attribute would 
> contain a media query describe what user environment, e.g. what devices 
> the text track is suitable for. If for example subtitles require a 
> minimum of 30 characters width to be displayed properly, but certain 
> devices cannot support this, the subtitles would be pretty useless on 
> such a device. Seeing as the <source> elements on media elements have 
> that attribute, too, it wouldn't be too difficult to implement the same 
> here.
> 
> Is this a "v2" feature or is it considered to be added?

I think we should probably wait to see how well media="" gets used with 
<source> before adding it to <track>, but I don't have any strong feelings 
on this front.

> (NOTE: there is a typo in section 4.8.10.10.5 when describing 
> MutableTimedTrack - in the green box, addCue() is repeated, but the 
> second one should be called removeCue() ).

Fixed.

> I wonder about the order in which <track> elements, mutable tracks, and 
> in-band TimedTracks are held. 4.8.10.10.1 states the above order (i.e. 
> <track> first, then mutable, then in-band). That <track> comes first 
> makes sense, since it is possible that different browsers choose 
> different media resources which may have different in-band tracks. Thus, 
> at least the numbering across the TimedTracks from <track> elements is 
> consistent. However, the in-band tracks will be available after the 
> media resource has been parsed, while the mutable tracks are 
> script-created and could be created dynamically through user 
> interaction. Does that mean, that the index of the in-band tracks can 
> change during the course of the Web page depending on how many mutable 
> tracks are available at a time?

The order can always change, e.g. if a <track> element is dynamically 
inserted.

> I would probably also more explicitly state that in-band tracks are only 
> chosen out of the media resource that is in @currentSrc.

How could the spec be interpreted otherwise? The only place that invokes 
the "steps to expose a media-resource-specific timed track" algorithm 
happens deep in the processing of the media resource.

> I am concerned about the definition of the TimedTrackCue. 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timedtrackcue
> 
> It has the following IDL attributes:
>   readonly attribute DOMString direction;
>   readonly attribute boolean snapToLines;
>   readonly attribute long linePosition;
>   readonly attribute long textPosition;
>   readonly attribute long size;
>   readonly attribute DOMString alignment;
> 
> All of these are related to CSS attributes and I wonder how that 
> interacts. For example, what if the @direction says "vertical" and the 
> CSS attribute for the cue says direction:rtl; ?

The .direction IDL attribute has more to do with the 'writing-mode' CSS 
property than 'direction', but in any case, the interaction is defined in 
detail in the rendering section.

> I am also confused about the snapToLines and linePosition attributes: 
> IIUC the linePosition is meant to be either a percentage of the video 
> dimensions or a line position relative to the first line of the cue. 
> Does that latter mean an offset from where the first line of the cue 
> should theoretically be?

See the rendering section for the formal definition.

> What is the purpose of it?

It allows people to specifically put cues on specific lines knowing that 
they will not overlap with each other. For example, in a scene with much 
overlapping dialog, you could put all the text of one person on line -4, 
and all the text of another on -2, and provided you only used two lines 
per cue, you'd know that they would be rendered in a consistent manner 
each time, rather than jumping up or down based on what other cues 
happened to be up at the time the cue came up.

> Does the earlier mean that we can only provide text for video and not 
> for audio, which has no dimensions?

If you want the browser to render cues, you have to use a <video> element 
so that there is somewhere to render them. You can play audio with the 
<video> element, and you can use <audio> and manually render the cues from 
JS if desired.

We could provide an API dedicated to making it easier to render cues 
manually if desired (firing an event or callback with the actual cue for 
each cue that shows, for example).

> What if we have a lyrics file for a piece of music? Can that not be 
> rendered?

Sure, it works the same as subtitles.

> And what if we wanted to render captions underneath a video rather than 
> inside video dimensions? Can that be achieved somehow?

You'd need to script it, currently. (I didn't see many (any?) cases of 
this in my research, so I didn't provide a declarative solution.)

> In http://www.mail-archive.com/whatwg@lists.whatwg.org/msg10395.html
> Dave Singer wrote:
> > Linking into a cue-range would be using its beginning or end as a seek 
> > point, or its duration as a restricted view of the media ("only show 
> > me cue-range called InTheBathroom"). Linking out of a cue-range would 
> > be establishing a click-through URL that would be dispatched directly 
> > if the user clicked on the media during that range (dispatched without 
> > script).
> 
> I believe in these use cases, too.

My reply in:

   http://www.mail-archive.com/whatwg@lists.whatwg.org/msg10469.html

...still applies.

> It is possible to jump to a cue range through its number in the list in 
> the media element using JavaScript and setting the @currentTime to that 
> cue range's start time. However, it has not yet been defined whether 
> there is a relationship between media fragment URIs and timed tracks. 
> The media framgent URI specification has such URIs defined as e.g. 
> http://example.com/video.ogv#id="InTheBathroom" and cues have a textual 
> identifier, so we can put these two together to enable this. Such URIs 
> will then be able to be used in the @src attribute or a media element 
> and focus the view on that cue, just like temporal media fragments do 
> with a random time range.

I'm not sure I follow. Presumably this is all for in-band timed tracks, in 
which case the HTML spec isn't really involved.

> For linking out of a cue, there is a need to allow having hyperlinks in 
> cues. IIUC this is currently only possible by using a HTML-style markup 
> in the cue, declaring the cue as kind=metadata and calling 
> getCueAsSource() on the cue, then running your own overlays and shoving 
> the retrieved text to the innerHTML of that overlay.

Having a hyperlink in a cue seems like really bad UI (having any temporal 
interactive UI is typically highly inaccessible, and is generally only 
considered a good idea in games). If you want to make the whole video 
into a link (as Dave suggested in the e-mail above, if I understood it 
correctly) then you don't need anything to do with timed tracks.

> While that works, it seems like a lot of hoops to jump through just to 
> be able to use a bit of HTML markup - in particular having to run your 
> own overlay. Could we introduce a kind=htmlfragment type where it is 
> obvious that the text is HTML and that the fragment parser can be run 
> automatically and display it through the given display mechanisms?

I would on the contrary think that that would be something we should 
_discourage_, not encourage!

> Many existing subtitle formats and similar media-time-aligned text 
> formats contain file-wide name-value pairs that explain metadata for the 
> complete resource. An example are Lyrics files, e.g.
> 
> On Tue, 20 Apr 2010, Silvia Pfeiffer wrote:
> >
> > Lyrics (LRC) files typically look like this:
> >
> > [ti:Can't Buy Me Love]
> > [ar:Beatles, The]
> > [au:Lennon & McCartney]
> > [al:Beatles 1 - 27 #1 Singles]
> > [by:Wooden Ghost]
> > [re:A2 Media Player V2.2 lrc format]
> > [ve:V2.20]
> > [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love,
> > <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60>
> > [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love,
> > <00:08.00>love<00:08.90>
> 
> You can see that there are title, artist, author, album, related 
> content, version and similar metadata information headers on this file. 
> Other examples contain copyright information and usage rights - 
> important information to understand and deal with when distributing 
> media-time-aligned text files on a medium such as the Web.

I don't really see why we would want to embed this in a timed track. Even 
in HTML embedding this kind of information has never taken off. We would 
need to have very compelling use cases, implementation experience, and 
implementation committements to move in such a direction, IMHO.

> I would think it'd be good to define a standard means of extracting 
> plain text out of any type of cue, so it will be possible to hand this 
> to e.g. the accessibility API for reading back.

Getting the raw data is already possible, unless I misunderstood what you 
meant.

> I would actually like to see an interface where the chapter makers can 
> be used for navigation through the media resource, e.g. as you are 
> playing back the media file, you can press SHIFT-rightarrow and 
> SHIFT-leftarrow to navigate back and forth within a track (in 
> particularly within a chapter track). This is particularly important for 
> blind users.

That's up to the browsers.

> > In WebSRT, this would be:
> >
> >  10:00.000 --> 20:00.000
> >  { title: "Chapter 2", description: "Some blah relating to chapter 2", image: "/images/chapter2.png" }
> >
> >  20:00.000 --> 30:00.000
> >  { title: "Chapter 3", description: "Chapter 3 blah", image: "/images/chapter3.png" }
> >
> > (Here I'm assuming that you want to store the data as JSON. For 
> > kind=metadata files, you can put anything you want in the cue so long 
> > as you don't have a blank line in there.)
> 
> I think it is a powerful idea to have a track kind that allows for 
> everything. This provides a platform to put absolutely anything into a 
> time-aligned form for a media resource. The standardisation aspect about 
> it is the means in which the association between the data and the media 
> resource happens, such that at least the cues can be extracted in a 
> standard manner. However, it opens up an issue about parsing and 
> display.
> 
> What would be displayed for such a JSON markup in an overlay?

Nothing. It's for script.

> Also, the parser for the cue data in the case of kind=metadata would not 
> be part of what the browser offers, so somebody using this approach 
> would need to provide their own JSON parser for the data before they can 
> do anything useful with it. Is there a plan to offer existing parser 
> functionality of the Web browser (e.g. RSS parsing, or Firefox's native 
> JSON parser, or the HTML fragment parser) to the user for this kind of 
> data in some way?

That seems unrelated to timed tracks. JSON, HTML, and XML parsing are 
already available to scripts.

> Since the @mode IDL attribute of an individual TimedTrack can take on 
> the value "showing" for several tracks at a time and all tracks of kind 
> "subtitle" or "caption" will be displayed, it is possible that multiple 
> TimedTracks are displaying cues at the same time. The display mechanism 
> at 14.3.2.1 deals with this, which is really cool. However, I wonder if 
> there is a limit to the number of tracks we want to allow rendering for 
> at the same time

The limit currently is as many as can fit.

> > Currently the files have to be same-origin. My plan is to wait for 
> > CORS to be well established and then use it for timed tracks, video 
> > files, images on <canvas>, text/event-stream resources, etc.
> 
> I would indeed like to see the possibility to re-use tracks from other 
> locations, such that e.g. a video can be published by one site, but 
> another site provides all the subtitles for it.

That is clearly a needed feature. As noted above, as soon as CORS is well 
established, I plan to use it in a number of places in HTML.

> I think it's untenable that we can only render TimedTracks on top of the 
> video viewport (see 
> http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0). 
> There is no means of rendering for audio and no means of rendering 
> outside the video element.

What's the use case?

> The rendering and CSS styling approach with ::cue described in 
> http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0 
> is only defined on WebSRT. That means that there is no styling possible 
> for TimedTracks that come from a different format (assuming we may allow 
> other formats in future).

Styling such formats would be quite possible, it just has to be defined.

> Also, it implies that there is no styling possible for in-band 
> TimedTracks and for MutableTimedTracks.

For in-band timed tracks, the styling is whatever is defined for that 
format of track. For MutableTimedTracks, it's the same as WebSRT.

> I think this is a bit restrictive and would rather we define a mechanism 
> to allow CSS styling of cues that come from any type of TimedTrack, and 
> thus make the CSS styling part independent of the format.

I don't know how to do that.

> Also, the actual CSS properties that are allowed are very restrictive

Yes, I welcome implementation feedback on this so that the list can be 
extended. I just put in some basic properties for now.

> IMO that defeats the reason for using CSS. The argument that all of CSS, 
> including future extensions, will be available to TimedTracks is only 
> half-true: the use of CSS is restricted to the given list here, so it's 
> not making use of all of CSS and its not automatically extensible. I 
> think that's a poor use of the opportunity that CSS poses.

Well, we can't make certain things available (like 'float') without 
significantly complicating the model (arguably without fatally 
complicating the model), so clearly (IMHO) we need some limits. However, 
I'm very happy to keep this list updated over time.

On Tue, 27 Jul 2010, Silvia Pfeiffer wrote:
>
> The @kind attribute is currently serving several purposes. This may be 
> ok, but we need to be aware of it and maybe include a note in its 
> description about it.
> 
> Firstly, the @kind attribute describes semantically what the track is: a 
> subtitle, caption, textual description, chapters or "metadata" (i.e. 
> "anything") track.
> 
> Secondly, the @kind attribute implies whether the track will be 
> displayed: subtitle and caption are rendered (right now just for the 
> video viewport, but I am hoping we can make this more general), chapters 
> are probably rendered with the controls, and textual descriptions and 
> "metadata" are not rendered.
>
> Thirdly, the @kind attribute implies what parser will be used on the 
> cues: subtitle, caption and chapters are parsed as simple markup, 
> chapters are parsed as just plain text stripped of any markup (so speech 
> synthesizers and braille devices can deal with it), and "metadata" is 
> parsed as arbitrary data for script use only.

These all seem like different facets of the same thing. It's like saying 
that the element name (<ol>, <textarea>, <pre>) affects the semantic, the 
rendering, and the parsing. Or similarly with MIME types.

> I think by understanding this and by making this explicit in the spec, 
> we can more clearly decide what track kinds are still missing and also 
> what we actually need to implement.

I'm not sure what to add to make this clearer. Can you elaborate?

On Sat, 7 Aug 2010, Silvia Pfeiffer wrote:
> 
> I think there's a typo in the description of the TimedTrack mode at 
> http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#timed-track-mode. 
> It says:
> 
> Hidden
> 
> Indicates that the timed track is active, but that the user agent is not 
> actively displaying the cues. If no attempt has yet been made to obtain 
> the track's cues, the user will perform such an attempt momentarily. The 
> user agent is maintaining a list of which cues are active, and events 
> are being fired accordingly.
> 
> But I think it should be "the user *agent* will perform such an attempt 
> momentarily."

Fixed.

On Tue, 27 Jul 2010, Sam Dutton wrote:
> >
> > The addCueRange() API has been removed and replaced with a feature 
> > based on the subtitle mechanism.
>
> I'm not sure what this means -- are you referring to timed track cues?

addCueRange() was an old API in the spec. You can do the same things with 
the new MutableTimedTrack API.

> Couple of minor queries:
> * 'time track' is referred to a couple of times in the spec -- it's not 
> clear why this is used instead of 'timed track'

Fixed.

> * 'the WebSRT file must WebSRT file using cue text' -- I guess this 
> should be 'the WebSRT file must be a WebSRT file using cue text'

Fixed.

> Also -- is trackgroup out of the spec?

What is trackgroup?

On Fri, 6 Aug 2010, Silvia Pfeiffer wrote:
> 
> Note that the subtitling community has traditionally been using the 
> Subrip (srt) or SubViewer (sub) formats as a simple format and 
> SubStation alpha (ssa/ass) as the comprehensive format. Aegisub, the 
> successor of SubStation Alpha, is still the most popular subtitling 
> software and ASS is the currently dominant format. However, even this 
> community is right now developing a new format called AS6. This shows 
> that the subtitling community also hasn't really converged on a "best" 
> format yet.

Also it's worth noting that the SubStation Alpha formats are very 
presentational in nature, and do not follow the HTML school of semantic 
language design at all. That is the main reason I didn't use those formats 
for HTML <video> captions.

> So, given this background and the particular needs that we have with 
> implementing support for a time-synchronized text format in the Web 
> context, it would probably be best to start a new format from a clean 
> slate rather than building it on an existing format.

I don't follow your reasoning here. As you said, SRT is a common subset of 
most of the formats you listed; why would the conclusion not be that we 
should therefore work with SRT?

> In contrast to being flexible about what goes into the cues, WebSRT is 
> completely restrictive and non-extensible in all the content that is 
> outside the cues. In fact, no content other than comments are allowed 
> outside the cues.

None is allowed today, but it would be relatively straight-forward to 
introduce metadata before the cues (or even in between the cues). For 
example, we could add defaults:

   *
   DEFAULTS
   L:-1 T:50% A:middle

   00:00:20,000 --> 00:00:24,400
   Altocumulus clouds occur between six thousand

   00:00:24,600 --> 00:00:27,800 
   and twenty thousand feet above ground level.

We could add metadata (here using a different syntax that is similarly 
backwards-compatible with what the spec parser does today):

   @charset --> win-1252
   @language --> en-US

   00:00:20,000 --> 00:00:24,400
   Altocumulus clouds occur between six thousand

   00:00:24,600 --> 00:00:27,800 
   and twenty thousand feet above ground level.

There are a variety of syntaxes we could use. So long as whatever we do is 
backwards compatible with what the first set of deployed parsers do, we're 
fine.

Currently comments aren't allowed, but we could add those too (e.g. by 
saying that any block of text that doesn't contain a "-->" is a comment).

> * there is no possibility to add file-wide metadata to WebSRT; things 
> about authoring and usage rights as well as information about the media 
> resource that the file relates to should be kept within the file. Almost 
> all subtitle and caption format have the possibility for such metadata

This is something we could add if there is a clear use case, but I'm not 
sure that there is. Why does SRT not have it today?

> and we know from image, music and video resources how important it is to 
> have the ability to keep such metadata inside the resource.

Do we? I thought from image, music, and video we learnt that it didn't 
make much difference! :-)

> * there is no language specification for a WebSRT resource; while this 
> will not be a problem when used in conjunction with a <track> element, 
> it still is a problem when the resource is used just by itself, in 
> particular as a hint for font selection and speech synthesis.

I didn't find many formats with a language specifier; is it really that 
much of a problem? Again, we can add it if it turns out to be a problem, 
but since our main concern here is the Web, I don't see much point adding 
this complexity to SRT if it isn't complexity that the SRT community 
needs.

> * there is no style sheet association for a WebSRT resource; this can be 
> resolved by having the style sheet linked into the Web page where the 
> resource is used with the video, but that's not possible when the 
> resource is used by itself. It needs something like a <link> to a CSS 
> resource inside the WebSRT file.

Do standalone SRT players want to support CSS? If not, it doesn't much 
matter.

> * there is no magic identifier for a WebSRT resource, i.e. what the 
> <wmml> element is for WMML. This makes it almost impossible to create a 
> program to tell what file type this is, in particular since we have made 
> the line numbers optional. We could use "-->" as an indicator, but it's 
> not a good signature.

Yeah, that's a problem. I considered adding "WEBSRT" at the start of every 
file but we couldn't use it reliably since WebSRT parsers presumably want 
to support SRT using the same parser, and that has no signature.

(Note that XML, and anything based on XML, as well as HTML, JS, and CSS, 
have no signature either. It's a common problem of text formats.)

> * there is no means to identify which parser is required in the cues (is 
> it "plain text", "minimal markup", or "anything"?) and therefore it is 
> not possible for an application to know how it should parse the cues.

Timed track cues are not context-free. In standalone players, the user 
says to play a particular cue file, so using the "cue text" mode is a good 
assumption (why would you give mplayer a metadata cue file to display?). 
Browsers have the <track> context.

> * there is no version number on the format, thus it will be difficult to 
> introduce future changes.

Version numbers are an antipattern in multivendor formats. This is an 
intentional feature, not an unfortunate omission. HTML itself has dropped 
the version number in its format; CSS has never had one. Most programming 
languages don't have one.

> I can understand that the definition of WebSRT took inspiration from SRT 
> for creating a simple format. But realistically most SRT files will not 
> be conformant WebSRT files because they are not written in UTF-8. 

I don't think they need to be conforming. They're already published. 
Conformance is just a quality assurance tool, it's only relevant for 
documents being written in the future.

> Further, realistically, all WebSRT files that use more than just the 
> plain text markup are not conformant SRT files.

What's a "conformant SRT file"?

> So, let's stop pretending there is compatibility and just call WebSRT a 
> new format.

Compatibility has nothing to do with conformance. It has to do with what 
user agents do. As far as I can tell, WebSRT is backwards-compatible with 
legacy SRT user agents, and legacy SRT files are compatible with WebSRT 
user agents as described by the spec.

> In fact, the subtitling community itself has already expressed their 
> objections to building an extension of SRT, see 
> http://forum.doom9.org/showthread.php?p=1396576 , so we shouldn't try to 
> enforce something that those for whom it was done don't want.

The subtitling community in question is looking for a presentational 
format. I think it is very reasonable to say that SRT is not interesting 
for that purpose. However, a presentational format isn't, as far as I can 
tell, suitable for the Web.

> * the mime type of WebSRT resources should be a different mime type to 
> SRT files, since they are so fundamentally different; e.g. text/websrt

That's what I originally suggested, and you said we should use text/srt 
because it is what people use, even though it's not registered. I think 
you were right; it makes no sense to invent a new MIME type here.

> * the file extension of WebSRT resources should be different from SRT 
> files, e.g. wsrt

Extensions are irrelevant on the Web. People can use whatever extension 
they want.

> Right now, there is "plain text", "minimum markup" and "anything" 
> allowed in the cues.

As far as I can tell there's just two modes -- plain text and text with 
WebSRT markup.

> Seeing as WebSRT is built with the particular purpose of bringing 
> time-synchronized text for HTML5 media elements, it makes no sense to 
> exclude all the capabilities of HTML.

I would on the contrary say that it makes no sense to take on all the HTML 
baggage when all we want to do is introduce subtitles to video. :-)

> Also, with all the typical parsers and renderers available in UAs, 
> support of innerHTML in cues should be simple to implement.

Nothing is ever simple when it involves an HTML parser.

> The argument that offline applications don't support it is not relevant 
> since we have no influence on whether standalone media applications will 
> actually follow the HTML5 format choice.

Standalone video players almost certainly won't want to embed a Web 
browser, sure. With WebSRT as currently designed, you can target 
standalone browsers without them having to change at all, and they can 
adopt the new features with minimum effort. We might even be able to bring 
a greater level of interoperability to the standalone media apps, if we're 
lucky (it would be purely luck of course; that isn't a goal).

> That WebSRT with "plain text" and "minimal markup" can be supported 
> easily in standalone media applications is a positive side effect, but 
> not an aim in itself for HTML5 and it should have no influence on our 
> choices.

It should have influence, but maybe not much.

> In the current form, WebSRT only makes limited use of existing CSS. I 
> see particularly the following limitations:
> 
> * no use of the positioning functionality is made and instead a new 
> means of positioning is introduced; it would be nicer to just have this 
> reuse CSS functionality. It would also avoid having to repeat the 
> positioning information on every single cue.

It doesn't make sense to position cues with CSS, because the position of 
cues is an intrinsic part of the cue semantic. Where a cues appears can 
change the plot of a show, for example (was it the evil twin who said 
something or the good twin?).

> * little use of formatting functionality is made by restricting it to 
> only use 'color', 'text-shadow', 'text-outline', 'background', 'outline' 
> and 'font'

The restrictions are mostly artificial and should be extended once we 
better understand the constraints here.

> * cue-related metadata ("voice") could be made more generic; why not 
> reuse "class"?

I don't know what this means. What is "class" and how does it differ from 
"voice"?

> * there is no definition of the "canvas" dimensions that the cues are 
> prepared for (width/height) and expected to work with other than saying 
> it is the video dimensions - but these can change and the proportions 
> should be changed with that

I don't understand. It's all defined in terms of percentages of the video 
dimensions.

> * it is not possible to associate CSS styles with segments of text, but 
> only with a whole cue using ::cue-part; it's thus not possible to just 
> highlight a single word in a cue

It is (just use <b> to highlight the word). It's not possible to style 
lots of parts differently, though, because there are no attribute- 
analogues on the tags in WebSRT currently.

> * when HTML markup is used in cues, as the specification stands, that 
> markup is not parsed and therefore cannot be associated with CSS; again, 
> this can be fixed by making innerHTML in cues valid

It also doesn't styling XML in cues, or rendering SVG, or XSL:FO... I 
don't see why that's a problem.

> * I noticed that it is not possible to make a language association with 
> segments of text and thus it is not possible to have text with mixed 
> languages.

Are mixed language subtitles common? I don't know that I've ever seen 
that.

> * Is it possible to reuse the HTML font systems?

What is the HTML font system?

> Having proposed a xml-based format, it would be good to understand 
> reasons for why it is not a good idea and why a plain text format that 
> has no structure other than that provided through newlines and start/end 
> time should be better and more extensible.

I don't understand what you mean by structure.

XML in general is a terrible authoring format. I don't see why we'd want 
to reuse XML for captions.

On Fri, 6 Aug 2010, Philip Jägenstedt wrote:
> 
> I really like the idea of letting everything before the first timestamp 
> in WebSRT be interpreted as the header. I'd want to use it like this:
> 
> # author: Fan Subber
> # voices: <1> Boy
> #         <2> Girl
> 
> 01:23:45.678 --> 01:23:46.789
> <1> Hello
> 
> 01:23:48.910 --> 01:23:49.101
> <2> Hello
> 
> It's not critical that the format of the header be machine-readable, but 
> we could of course make up a key-value syntax, use JSON, or something 
> else.

We could put blocks like that anywhere we need to in a future version, so 
long as we design the format of such blocks such that they don't conflict 
with what the parser does today.

On Fri, 6 Aug 2010, Philip Jägenstedt wrote:
> 
> I'm not particularly fond of the current voice markup, mainly for 2 
> reasons:
> 
> First, a cue can only have 1 voice, which makes it impossible to style 
> cues spoken/sung simultaneously by 2 or more voices. There's a karaoke 
> example of this in 
> <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_voices>

That's just two cues.

> I would prefer if voices could be mixed, as such:
> 
> 00:01.000 --> 00:02.000
> <1> Speaker 1
> 
> 00:03.000 --> 00:04.000
> <2> Speaker 2
> 
> 00:05.000 --> 00:06.000
> <1><2> Speaker 1+2

What's the use case?

> Second, it makes it impossible to target a smaller part of the cue for 
> styling. We have <i> and <b>, but there are also cases where part of the 
> cue should be in a different color, see 
> <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_colors>

Well you can always restyle <i> or <b>.

> If one allows multiple voices, it's not hard to predict that people will 
> start using magic numbers just to work around this, which would both be 
> wrong semantically and ugly to look at:
> 
> 00:01.000 --> 00:02.000
> <1> I like <1234>blue</1234> words.
> 
> They'd then target 1234 with CSS to color it blue.
> 
> I'm not sure of the best solution. I'd quite like the ability to use 
> arbitrary voices, e.g. to use the names/initials of the speaker rather 
> than a number, or to use e.g. <shouting> in combination with CSS :before 
> { content 'Shouting: ' } or similar to adapt the display for different 
> audiences (accessibility, basically).

Yeah, there are some difficult-to-satisfy constraints here. On the one 
hand having a predefined set of voices leads to better semantics, 
usability for authors, and accessibility; on the other hand we need 
something open-ended because we can't think of everything. We also have to 
make sure we don't enable voices to conflict with future tag names, so 
whatever we do that's open-ended would have to use a specific syntax (like 
being all numbers, which is what I currenlty have). I'm not sure how to 
improve on what we have now, but it's certainly not perfect.

On Wed, 11 Aug 2010, Philip Jägenstedt wrote:
> 
> What should numerical voices be replaced with? Personally I'd much 
> rather write <philip> and <silvia> to mark up a conversation between us 
> two, as I think it'd be quite hard to keep track of the numbers if 
> editing subtitles with many different speakers.

We could say that a custom voice has to start with some punctuation or 
other, say <:philip>?

On Wed, 11 Aug 2010, Silvia Pfeiffer wrote:
> 
> In HTML it is <span class="philip">..</span> and <span 
> class="silvia">...</span>. I don't see anything wrong with that. And 
> it's only marginally longer than <philip> ... </philip> and 
> <silvia>...</silvia>.

It's quite a lot more verbose than what the spec has now... (just "<1>".)

On Thu, 12 Aug 2010, Philip Jägenstedt wrote:
> 
> The core "problem" is that WebSRT is far too compatible with existing 
> SRT usage. Regardless of the file extension and MIME type used, it's 
> quite improbable that anyone will have different parsers for the same 
> format. Once media players have been forced to handle the extra markup 
> in WebSRT (e.g. by ignoring it, as many already do) the two formats will 
> be the same, and using WebSRT markup in .srt files will just work, so 
> that's what people will do. We may avoid being seen as arrogant 
> format-hijackers, but the end result is two extensions and two different 
> MIME types that mean exactly the same thing.

I think we'll look equally arrogant if we ignore years of experience with 
subtitling formats and just make up an entirely new format. It's not like 
the world is short of subtitling formats.

On Wed, 18 Aug 2010, Silvia Pfeiffer wrote:
>
> It actually burns down to the question: do we want the simple SRT format 
> to survive as its own format and be something that people can rely upon 
> as not having "weird stuff" in it - or do we not. I believe that it's 
> important that it survives.

Does that format still exist? Is it materially different than WebSRT?

On Sat, 21 Aug 2010, Silvia Pfeiffer wrote:
> 
> It's not just about implementation cost - it's also the problem of 
> maintaining another spec that can grow to have eventually all the 
> features that HTML5 has and more. Do you really eventually want to 
> re-spec and re-implement a whole innerHTML parser plus the extra <t> 
> element when we start putting <svg> and <canvas> and all sorts of other 
> more complex HTML features into captions? Just because the <t> element 
> is making trouble now? Is this really the time to re-invent HTML?

No, it's not. We should never let subtitles get that crazy.

On Mon, 23 Aug 2010, Philip Jägenstedt wrote:
> 
> I don't expect that SVG, <canvas>, images, etc will ever natively be 
> made part of captions. Rather, I would hope that the metadata state 
> together with scripts is used. If we think that e.g. images in captions 
> are an important use case, then WebSRT is not a good solution.

Indeed.

> If we allow arbitrary HTML and expect browsers to handle it well, it 
> adds some complexity. For example, any videos and images in the cue 
> would have to be fully loaded and ready to be decoded by the time the 
> cue is to be shown, which I really don't want to implement the logic 
> for. Simply having an iframe-like container where the document is 
> replaced for each cue wouldn't be enough, rather one would have to 
> create one document per cue during parsing and wait for all of those to 
> finish loading before beginning playback. I'm not sure, but I'm guessing 
> that amounts to significant memory overhead.

Quite.

> As an aside, I personally see it as a good things that <font> *doesn't* 
> work in WebSRT, whereas it would using an HTML parser.

Agreed!

> Deployed SRT uses <i>, <b>, <font> and <u>. WebSRT adds <ruby>, <rt> and 
> <1>...<infinity>, extensions which are very much in line with the 
> existing format and already "works" in many players (in the sense that 
> they are ignored, not rendered). I wouldn't call that a huge mess.

Yes.

On Tue, 24 Aug 2010, Silvia Pfeiffer wrote:
> 
> I believe [SVG etc] will be [added to WebSRT]. But since we are only 
> looking at the ways in which captions and subtitles are used currently, 
> we haven't accepted this as an important use case, which is fair enough. 
> I am considering likely future use though, which is always hard to 
> argue.

In all my research for subtitles, I found very few cases of anything like 
this. Even DVDs, whose subtitle tracks are just hardcoded bitmap images, 
don't do anything fancy with them... just plain text and italics, 
generally. Why haven't people started doing fancy stuff with subtitles in 
all the years that we've had TVs? It's not like they can't do it.

My guess is that the real reason is that when you get so fancy that you're 
including graphics and the like, you're no longer doing timed tracks, 
you're just doing content, and the right thing to do is to either burn it 
in, or consider it a separate construct animated on top of the video, e.g. 
an <svg:video> and SMIL.

> It is not at all similar to HTML4 and HTML5. A Web browser cannot 
> suddenly stop working for a Web page, just because it has some extra 
> functionality in it. Thus, the HTML format has been developed such that 
> it can be extended without breaking existing stuff. We can guarantee 
> that no browser will break because that is the way in which the format 
> has been specified.

It's the way it's specified now, but it wasn't before.

> No such thing has happened for SRT and there is simply no way to 
> guarantee that all new WebSRT files will work in all existing SRT 
> software, because SRT has not been specified as a extensible format and 
> because there is no agreement between all parties that have implemented 
> SRT support as to how extensions should be made.

There's almost as much agreement as with HTML4, as far as I can tell. 
Maybe a little less, but the format is so much simpler that it doesn't 
matter as much.

On Tue, 24 Aug 2010, Philip Jägenstedt wrote:
> 
> Here's the SRT research I promised: 
> http://blog.foolip.org/2010/08/20/srt-research/

Awesome! Thanks for this.

Addressing points in the same order:

 - charset: resolved by introducing a charset override.

 - blank lines not separating cues: I couldn't find a client that 
   supported missing the blank line, so I didn't support that. It's a 
   small number of files, and a small number of cues within those files, 
   I presume, so I'm not too worried.

 - overlapping cues: supporting these is pretty important, so files with 
   overlapping cues will just have some weird artefects on playback.

The remaining data is interesting but seems to be consistent with our 
expectations before WebSRT was specced.

On Wed, 25 Aug 2010, Silvia Pfeiffer wrote:
> 
> Yeah, I'm totally for adding a hint as to what format is in the cue. 
> Then, a WebSRT file can be identified as to what it contains.

Can't it be identified just by looking? Or looking at its name? I don't 
really understand what problem this is solving. It's not like people are 
loading up random SRT files and seeing what they are; specific SRT files 
are sought out for use with specific videos.

> [...] I think logically text/websrt makes more sense with a .wsrt 
> extension. Then, also SRT files can be served as text/websrt to allow 
> them to take part in the WebSRT infrastructure if indeed they will 
> continue to be valid WebSRT files.

I don't understand the problem with text/srt. Why should we invent our own 
type? People already use text/srt. It smacks of "not invented here" to 
start making up our own types. You convinced me of this. :-)

> Incidentally, [is it] a problem if WebSRT files are served as 
> text/plain, i.e. will the browser not identify them as subtitle files?

On Wed, 25 Aug 2010, Philip Jägenstedt wrote:
> 
> "The tasks queued by the fetching algorithm on the networking task 
> source to process the data as it is being fetched must examine the 
> resource's Content Type metadata, once it is available, if it ever is. 
> If no Content Type metadata is ever available, or if the type is not 
> recognised as a timed track format, then the resource's format must be 
> assumed to be unsupported (this causes the load to fail, as described 
> below)."
> 
> In other words, browsers should have a whitelist of supported text track 
> format, just like they should for audio and video formats. (Note though 
> that Safari and Chrome ignore the MIME type for audio/video and will 
> likely continue to do so.)
> 
> It seems to that a side-effect of this is that it will be impossible to 
> test <track> on a local file system, as there's no MIME type and 
> browsers aren't allowed to sniff. Surely this can't be the intention, 
> Hixie?

Local file systems generally use extensions to declare file types (at 
least, on Windows and Mac OS X).

On Thu, 26 Aug 2010, Chris Double wrote:
> 
> Firefox (in the case of video) uses file extensions to identify video
> files. We have an internal maping of file extensions to mime types. We
> don't sniff the content. I imagine we'd do the same with whatever file
> extension is used for WebSRT.

(I assume this is only for the filesystem, not data from the wire!)

On Wed, 25 Aug 2010, Silvia Pfeiffer wrote:
>
> Yes, I have no problem with that. Though I believe we have overloaded 
> @kind with too much meaning as I already mentioned earlier. I think it 
> would make more sense to pull the different dimensions into different 
> attributes:
>
> - @type or @format for the format of the cue
>
> - @kind for the semantic meaning of it (subtitle, caption, karaoke etc) 
> - one track could even satisfy several needs, so this would be a lit of 
> kinds
>
> - and finally the visual rendering problem, which could possibly be 
> solved by providing a link to a div or p where the data should be 
> rendered alternatively to the default. Right now, audio and metadata 
> tracks get no rendering at all and I see that as a problem.

I don't understand. What combinations do you think make sense that aren't 
already supported by the few kind="" values? We don't want to have lots of 
meaningless combinations, we just make authoring harder and make it more 
likely that there'll be bogus content and implementation bugs in edge 
cases, which we'll all end up having to copy, etc.

On Wed, 25 Aug 2010, Philip Jägenstedt wrote:
> 
> The main reason to care about the MIME type is some kind of "doing the 
> right thing" by not letting people get away with misconfigured servers. 
> Sometimes I feel it's just a waste of everyone's time though, it would 
> generally be less work for both browsers and authors to not bother.

Agreed. Not sure what to do for WebSRT though, since there's no good way 
to recognise a WebSRT file as opposed to some other format.

On Thu, 26 Aug 2010, Silvia Pfeiffer wrote:
> 
> You misunderstand my intent. I am by no means suggesting that no WebSRT 
> content is treated as SRT by any application. All I am asking for is a 
> different file extension and a different mime type and possibly a magic 
> identifier such that *authoring* applications (and authors) can clearly 
> designate this to be a different format, in particular if they include 
> new features.

Wouldn't an authoring application just have two (or more) different "save 
as" or "export" format options? "Save as SRT with no formatting", "Save as 
SRT with <b> only", "Save as WebSRT", or whatnot. Or a list of checkboxes 
for standalone user agents to be compatible with, so that it can pick the 
common subset.

> Then a *playback application* has the chance to identify them as a 
> different format and provide a specific parser for it, instead of 
> failing like Totem. They can also decide to extend their existing SRT 
> parser to support both WebSRT and SRT. And I also have no issue with a 
> user deciding to give a WebSRT file a go by renaming it to .srt.

I think you think there's more difference between WebSRT and SRT than 
there is. In practice, there is less difference between WebSRT and the 
equivalent SRT file than there is between two random SRT files today. The 
difference between WebSRT and SRT is well within the "error bars" of what 
SRT is today.

> By keeping WebSRT and SRT as different formats we give the applications 
> a choice to support either, or both in the same parser. If we don't, we 
> force them to deal in a single parser with all the oddities of SRT 
> formats as well as all the extra features and all the extensibility of 
> WebSRT.

I don't understand what the difference would be.

On Thu, 26 Aug 2010, Henri Sivonen wrote:
> 
> Why wouldn't it always be a superior solution for all parties to do the 
> following:
>
>  1) Make sure WebSRT never requires processing that'd require rendering 
> a substantial body of legacy .srt content in a broken way. (This would 
> require supporting non-UTF-8 encodings by sniffing as well as supporting 
> <font> and <u>, which would happen "for free" if my innerHTML proposal 
> were adopted.)
>
>  2) Make playback software that supports WebSRT only have a WebSRT code 
> path and use that code path for legacy .srt content as well.
>
> ?

I agree that that would be simplest. I disagree that you'd have to support 
<font> and <u> to do that; I don't think losing colour or underlining is 
"breaking", and it apparently affects less than 5% of files. The encoding 
thing is basically resolved for Web browsers; I don't know that we can do 
much to resolve it for standalone players with legacy SRT files.

On Tue, 24 Aug 2010, Henri Sivonen wrote:
> 
> I'm rather unconvinced by the voice markup as well. As far as I can 
> tell, the voice markup is syntactic sugar for class for practical 
> purposes. (I don't give much value to arguments that voices are more 
> semantic than classes if the pratical purpose in to achieve visual 
> effects for caption rendering.) Common translation subtitling works just 
> fine without voice identification and (based on information in this 
> thread) the original .srt doesn't have voices.

About 3% of SRT files effectively have voices, through the use of <font>. 
If you include <i> in SRT files as equivalent to a voice with some special 
purpose like "narrator", then it's more like 50%.

I think <narrator> is a better way of doing it than <i>, personally.

> If voices are really needed for captioning use cases, I think it makes 
> sense to balance the rarity of that need within the captioning 
> spherewith the complexity of introducing syntactic sugar over the class 
> attribute and the class selector.

It's not syntactic sugar, since it's the only way. It's just a different 
syntax for a similar concept (indeed, just like element names are a 
different syntax for a similar concept!).

On Wed, 25 Aug 2010, Silvia Pfeiffer wrote:
> 
> How would the Web browser or in fact any parsing application know what 
> to do with the cues? This is actually a question for WebSRT. Unless 
> there is a hint as to how to parse the stuff in the cue, it would need 
> to do something like "content sniffing" to find out if it's "JSON" or 
> "plain text" or "minimal markup". Right now, the hint for how to parse 
> the cue in WebSRT comes from the track @kind attribute. That is not 
> helpful for a stand-alone application.

Standalone applications get told which tracks to display, and always 
display them as subtitles/captions, so this point is moot as far as I can 
tell.

> [...] You're also excluding roll-on captions then which is a feature of 
> live broadcasting.

It isn't clear to me that an external file would be a good solution for 
live broadcasting, so I'm not sure this really matters.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'