[whatwg] Timed tracks for <video>

Ian Hickson ian at hixie.ch
Thu Jul 22 22:40:57 PDT 2010


I recently added to the HTML spec a mechanism by which external subtitles 
and captions can be added to videos in HTML.

In designing this feature I went through hundreds and hundreds of e-mails, 
blogs, proposals, etc, trying to get all the key use cases that needed 
handling. (Replies to the WHATWG e-mails on the topic are included below.)

The proposal consists of several components:

 - A <track> element for linking to timed tracks from the markup.
 - A DOM API for manipulating timed tracks dynamically.
 - A specification for a simple captioning format.
 - A set of rules and processing models to hold it all together.

I tried to keep the following principles in mind when designing this 
feature. These are principles in line with the attitude used for the rest 
of the HTML specification's design also.

 - Keep things simple: features that don't have clear use cases and broad 
   appeal shouldn't be considered.
 - Keep implementation costs for standalone players low.
 - Use real data to determine what use cases are relevant.
 - Use existing technologies where appropriate.
 - Don't innovate, but provide others with the ability to do so.
 - Try as much as possible to have things Just Work.

I first researched (with some help from various other contributors - 
thanks!) what kinds of timed tracks were common. The main classes of use 
cases I tried to handle were plain text subtitles (translations) and 
captions (transcriptions) with minimal inline formatting and karaoke 
support, chapter markers so that browsers could provide quick jumps to 
points in the video, text-driven audio descriptions, and application- 
specific timed data. Text-driven audio descriptions aren't common (most 
audio descriptions use audio files), so these may be a little more 
innovative than makes sense for a specification. The specification so far 
doesn't really say how they are to be interpreted; I expect this area to 
change significantly based on implementation feedback.

The most controversial feature may be the captioning format. Based on SRT, 
a widely-supported and somewhat widely-used format, it tries to apply the 
principles of separating presentation from semantics. I've defined some 
extensions to CSS to help style the subtitles in browsers, while keeping 
the format compatible with existing players so that careful authors can 
write files that work both in new browsers and existing players.

Currently this format's parser rules assume UTF-8; this may be something 
we have to change in the future. Unfortunately since existing files don't 
seem to declare their encodings we can't easily reuse existing files 
unless we provide an out-of-band encoding override such as the "charset" 
attribute on <script>. I haven't gone there yet, but we can consider this 
if supporting existing files without having them convert to UTF-8 is 
desired. The parser is designed to be very easy to implement while leaving 
us lots of room to extend the format in the future.


The rest of this e-mail consists of replies to various e-mails sent on the 
subject in the last year. I haven't included all e-mails; in particular, a 
number of e-mails that were very useful in the development of the proposed 
format and API consisted of just use cases and links, which I haven't 
reproduced below. Nonetheless, thanks to the authors of those e-mails.


On Thu, 16 Jul 2009, Silvia Pfeiffer wrote:
> 
> [...] the problem with creating a new format is that you start from 
> scratch and already spreaded formats are not supported.

Indeed. Hopefully by reusing SRT we are able to leverage a significant 
amount of existing infrastructure, such as it is.


> I can see that your proposed format is trying to be backwards compatible 
> with SRT, so at least it would work for the large number of existing srt 
> file collections. I am still skeptical, in particular because there are 
> no authoring systems for this format around.

SRT was invented for an authoring tool, so clearly there are some. The 
good thing about this format, though, is that it is really simple to 
support and therefore tool creation should be an easy matter.


> >> In fact, the easiest solution would be if that particular format was 
> >> really only HTML.
> >
> > IMHO that would be absurd. HTML means scripting, embedded videos, an 
> > unbelivably complex rendering system, complex parsing, etc; plus, 
> > what's more, it doesn't even support timing yet, so we'd have to add 
> > all the timing and karaoke features on top of it. Requiring that video 
> > players embed a timed HTML renderer just to render subtitles is like 
> > saying that we should ship Microsoft Word with every DVD player, to 
> > handle the user input when the user wants to type in a new chapter 
> > number to jump to.
> 
> I agree, it cannot be a format that contains all the complexity of HTML. 
> It would only support a subpart of HTML that is relevant, plus the 
> addition of timing - and in this case is indeed a new format.

If we don't use HTML wholesale, then there's really no reason to use HTML 
at all. (And using HTML wholesale is not really an option, as you say 
above.) Profiling HTML is a huge amount of work, however, for everyone 
involved: you have to basically write a whole new spec, you have to write 
a whole new set of test cases, and you have to write a whole new set of 
implementations -- some code reuse would be possible in the latter case, 
though great care has to be taken to make sure only the right code is 
reused, as otherwise there would be a huge risk that people implement more 
than the profile, leading to huge interop problems.


> I have therefore changed my mind since I sent that email in Dec 08 and 
> am hoping we can do it with existing formats.

Excellent. :-)


> In particular, I have taken an in-depth look at the latest specification 
> from the Timed Text working group that have put years of experiments and 
> decades of experience into developing DFXP. You can see my review of 
> DFXP here: 
> http://blog.gingertech.net/2009/06/28/a-review-of-the-w3c-timed-text-authoring-format/ 
> I think it is both too flexible in a lot of ways, but also too 
> restrictive in others. However, it is a well formulated format that is 
> also getting market traction. In addition, it is possible to formulate 
> profiles to add missing functionality.

I looked at DXFP when looking at which format to use:

   http://wiki.whatwg.org/wiki/Timed_track_formats

Compared to the other formats I examined, it performs poorly on pretty 
much every front:

 - It's based on XML, which means it is hard to hand-author and quite 
   verbose. It uses namespaces, which means it's very prone to authoring 
   errors.

 - It is very much formatting-centric, rather than semantic-based.

 - It uses (amongst other things) pixel-based positioning, which makes it 
   very awkward to use when you don't know the size of the video file or 
   of the display.

 - It would be quite difficult to expose sanely in a simple API; you'd 
   basically just have to expose the whole DOM and hope for the best.

 - It is quite hard to use it to do karaoke-style internally timed cues.


> If we want a quick and dirty hack, srt itself is probably the best 
> solution. If we want a well thought-out solution, DFXP is probably a 
> better idea.

I don't really agree with that characterisation.


> >> Here is my example again:
> >> <video src="http://example.com/video.ogv" controls>
> >>  <text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
> >>  <text category="SUB" lang="de" type="application/ttaf+xml" src="german.dfxp"></text>
> >>  <text category="SUB" lang="jp" type="application/smil" src="japanese.smil"></text>
> >>  <text category="SUB" lang="fr" type="text/x-srt" src="translation_webservice/fr/caption.srt"></text>
> >> </video>
> >
> > Here's a counterproposal:
> >
> >   <video src="http://example.com/video.ogv"
> >          subtitles="http://example.com/caption.srt" controls>
> >   </video>
> 
> Subtitle files are created to enable users to choose the text in the 
> language that they speak to be displayed. With a simple addition like 
> what you are proposing, I don't think such a choice is possible. Or do 
> you have a proposal on how to choose the adequate language file?

The proposal now looks like:

   <video src="http://example.com/video.ogv" controls>
     <track srclang=en src="caption.srt">
     <track srclang=de kind=subtitles src="german.dfxp">
     <track srclang=jp kind=subtitles src="japanese.smil">
     <track srclang=fr kind=subtitles src="translation_webservice/fr/caption.srt">
     (...fallback...)
   </video>


> Also, the attributes on the proposed text element of course serve a purpose:
> * the "category" attribute is meant to provide a default for styling
> the text track,

I renamed this to "kind" in the spec.


> * the "language" attribute is meant to provide a means to build a menu
> to choose the adequate subtitle file from,

"srclang".


> * the "type" attribute is meant to both identify the mime type of the
> format and the character set used in the file.

It's not clear that the former is useful. The latter may be useful; I 
haven't supported that yet.


> The character set question is actually a really difficult problem to get 
> right, because srt files are created in an appropriate character set for 
> the language, but there is no means to store in a srt file what 
> character set was used in its creation. That's a really bad situation to 
> be in for the Web server, who can then only take an educated guess. By 
> giving the ability to the HTML author to specify the charset of the srt 
> file with the link, this can be solved.

Yeah, if this is a use case people are concerned about, then I agree that 
a solution at the markup level makes sense.


On Thu, 16 Jul 2009, Philip Jägenstedt wrote:
> 
> There are already more formats than you could possibly want on the scale 
> between SRT (dumb text) and complex XML formats like DFXP or USF (used 
> in Matroska).

Indeed. I tried to examine all of them, but many had no documentation that 
I could find. The results are in the wiki page cited above.


> In my layman opinion both extremes make sense, but anything in between 
> I'm rather skeptical to.

Is the SRT variant described in the spec extreme enough to make sense?


> I think that eventually we will want timing/synchronization in HTML for 
> synchronizing multiple video or audio tracks.

Agreed. We need this for various purposes, including possibly most 
importantly sign-language tracks.


> As far as I can tell no browser wants to implement the addCueRange API 
> (removing this should be the topic of a separate mail), so we really 
> need to re-think this part and I think that timed text plays an 
> important part here.

The addCueRange() API has been removed and replaced with a feature based 
on the subtitle mechanism.


On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
> 
> * I can see a need for a multitude of different categories of 
> time-aligned text that either already exist or will be developed in the 
> future. The list that I can currently grasp is mentioned in the 
> specification. While these text categories are rather diverse (e.g. 
> karaoke text, ticker text, chapter markers, captions), they all share 
> common properties and can be handled in fundamentally the same way by a 
> browser. I therefore propose a common "itext" element (for "included 
> text") to deal with associating such time-aligned text resources with 
> <video> resources.

That's more or less what <track> is now. I started with just a small set 
of kinds of tracks, but we can add more in the future if new ones are 
found to be useful.


> * I can also see a need for internationalisation of each text category. 
> I.e. each text resource will come with an associated language for which 
> it is valid and alternative language resources will be made available. 
> This is why I am suggesting the @lang attribute.

I went with srclang="", since lang="" is already used for another purpose.


> * It is unclear, which of the given alternative text tracks in different 
> languages should be displayed by default when loading an <itext> 
> resource. A @default attribute has been added to the <itext> elements to 
> allow for the Web content author to tell the browser which <itext> 
> tracks he/she expects to be displayed by default. If the Web author does 
> not specify such tracks, the display depends on the user agent (UA - 
> generally the Web browser): for accessibility reasons, there should be a 
> field that allows users to always turn display of certain <itext> 
> categories on. Further, the UA is set to a default language and it is 
> this default language that should be used to select which <itext> track 
> should be displayed.

It's not clear to me that we need a way to do this; by default presumably 
tracks would all be off unless the user wants them, in which case the 
user's preferences are paramount. That's what I've specced currently. 
However, it's easy to override this from script.


> * Another typical feature of time-aligned text files is that they may be 
> out of sync with the actual video file. For that purpose, a @delay 
> attribute was suggested as an addition to the <itext> element. This has 
> not been implemented into the demo. In the feedback to this proposal, a 
> further "stretch" or "drift" attribute was suggested.

I haven't added this yet, but it's an interesting idea (possibly best kept 
until a "v2" release though). One can implement this from script by 
creating a new track that simply copies the previous one cue-for-cue with 
an offset applied, so we'll be able to see if this is something for which 
there is real demand by seeing if anyone does that.


On Fri, 31 Jul 2009, Philip Jägenstedt wrote:
> 
> * DOM interfaces. These should be the same regardless of source 
> (external/internal) and format. I also believe these ranges have event 
> listeners, so as to replace the cue ranges interface.

Done.


> * Security. What restrictions should apply for cross-origin loading?

Currently the files have to be same-origin. My plan is to wait for CORS to 
be well established and then use it for timed tracks, video files, images 
on <canvas>, text/event-stream resources, etc.


> * Complexity. There is no limit to the complexity one could argue for 
> (bouncing ball multi-color karaoke with fan translations/annotations 
> anyone?). We should accept that some use cases will require creative use 
> of scripts/SVG/etc and not even try to solve them up-front. Draw a line 
> and stick to it.

Agreed. Hopefully you agree with where I drew the line! :-)


On Sun, 11 Apr 2010, Silvia Pfeiffer wrote:
> On Sun, Apr 11, 2010 at 4:18 PM, Robert O'Callahan wrote:
> >
> > This needs to be clarified. Authors can position arbitrary content 
> > over the video, and presumably the browser is not supposed to ensure 
> > rendered text doesn't collide with such content. I presume what you 
> > meant is simply that rendered text must not collide with browser 
> > built-in UI. Although I'm not sure how that can be ensured when 
> > arbitrary styling of the rendered text is supported.
> 
> Yes, the idea was for browser built-in default UI controls. [...]
>
> The main issue is to keep the area that captions or subtitles are 
> rendered into and the area that controls are rendered into separately, 
> since you will always want to have access to both if both are activated.

I've made sure that WebSRT titles avoid overlapping the controls.


On Sun, 11 Apr 2010, Chris Double wrote:
> 
> I am wary of being required to implement the entire TTML specification 
> and an underspecified SRT format.

I've specified the SRT format and not required TTML support (though I have 
provided rough guidelines for how to implement it in a manner consistent 
with the rest of the captions model if you do want to implement it).


On Mon, 12 Apr 2010, Philip Jägenstedt wrote:
> 
> For the record, I am also not enthusiastic about TTML, specifically the 
> styling mechanism which even makes creative use of XML namespaces. An 
> example for those that haven't seen it before: 
> http://www.w3.org/TR/ttaf1-dfxp/#style-attribute-backgroundColor-example-1
>
> <region xml:id="r1">
>  <style tts:extent="306px 114px"/>
>  <style tts:backgroundColor="red"/>
>  <style tts:color="white"/>
>  <style tts:displayAlign="after"/>
>  <style tts:padding="3px 40px"/>
> </region>
> ...
> <p region="r1" tts:backgroundColor="purple" tts:textAlign="center">
>  Twinkle, twinkle, little bat!<br/>
>  How <span tts:backgroundColor="green">I wonder</span> where you're at!
> </p>
> 
> While I don't have any suggestions about what to use instead, I'd much 
> prefer something which just uses CSS with the same syntax we're all used 
> to.

I've defined some CSS extensions to allow us to use CSS with SRT.


On Tue, 13 Apr 2010, Jonas Sicking wrote:
> 
> Will implementations want to do the rendering of the subtitles off the 
> main thread? I believe many browsers are, or are planning to, render the 
> actual video graphics using a separate thread. If that is correct, do we 
> want to support rendering of the subtitles on a separate thread too?
> 
> Or is it enough to do the rendering on the main thread, but composit 
> using a separate thread?
> 
> If rendering is expected to happen on a separate thread, then CSS is 
> possibly not the right solution as most CSS engines are main-thread-only 
> today.

I've tried to scope the complexity of the way CSS is used to style WebSRT 
so as to avoid making it too hard to precompute the information needed to 
style the titles and then shove the data to the other thread for final 
processing and rendering. I don't know how successful I have been (you 
might still have to send updates regularly, basically anytime the computed 
styles of the <video> element itself changes, or any time the DOM changes 
in a way that affects which selectors match the cues), but those changes 
can be sent asynchronously, it seems. The actual rendering is somewhat 
scoped; there's only so much you can do. For example, you can't change the 
'display' value, use 'float's, collapse margins, or anything like that.


On Wed, 14 Apr 2010, Jonas Sicking wrote:
> 
> I like this approach, though I wonder how it's intended to attach a 
> stylesheet to the SRT+HTML file?

Currently, styles are attached to the HTML file that contains the <video>, 
and apply to the SRT file. If this is successful, it would make sense to 
add a mechanism to SRT files to link straight to the CSS files, but I 
don't think that's a priority yet.


> Of course, even better would be to have a markup language for marking up 
> the meaning of the timed text. For example, it's unfortunate that the 
> DFXP markup contains things like
> 
> [Water dropping]<br/>
> [<span tts:fontStyle="italic" tts:color="lime">plop, plop, plop, …</span>]
> 
> Where the brackets clearly mean that the contained text isn't being 
> said, but that they are sound effects. This would be much better done 
> with markup like:
> 
> <descriptive>Water dropping</descriptive>
> <soundeffect>plop, plop, plop</soundeffect>

In WebSRT, this is now:

   <sound>Water dropping
   plop, plop, plop

...or, if you want the square brackets to render:

   <sound>[Water dropping]
   [plop, plop, plop]

To style it as lime italics, the CSS is:

   ::cue-part(sound) { font-style: italic; color: lime; }


> On a separate note, I note that the DFXP file seems to be specific to a 
> specific size of the video. If I resize the video, the captions that go 
> on top of the video doesn't move appropriately. This could very well 
> simply be due to this being a demo. Or due to a bug in the 
> "implementation". Or a simple mistake on on the part of the author of 
> the specific DFXP file.

In WebSRT, the positioning is all with percentages (or relative line 
numbers), side-stepping this issue (this also neatly solves the problem 
that occurs when the ratio changes, e.g. due to someone accidentally 
forcing something into anamorphic view).


On Thu, 15 Apr 2010, Silvia Pfeiffer wrote:
> 
> A spec would need to be written for this new extended SRT format.

Done. :-)


> Also, if we are introducing HTML markup inside SRT time cues, then it 
> would make sense to turn the complete SRT file into markup, not just the 
> part inside the time cue.

That doesn't seem to follow. Could you elaborate?


> Further, SRT has no way to specify which language it is written in

What's the use case?


> further such general mechanisms that already exist for HTML.

Such as?


> That extension doesn't give us anything anyway, since no existing SRT 
> application would be able to do much with it.

Why not?


> It is not hard to replicate the SRT functionality in something new. If 
> we really want to do "SRT + HTML + CSS", then we should start completely 
> from a blank page.

I'm not a big fan of reinventing wheels. Only when what already exists 
simply doesn't handle all the use cases and can't be extended to do so 
does it make sense to start over.


On Thu, 15 Apr 2010, Philip Jägenstedt wrote:
> 
> While I don't favor TTML, I also don't think that extending SRT is a 
> great way forward, mostly because I don't see how to specify the 
> language (which sometimes helps font selection),

That's done in the <track> element. It can't be in the file, since you 
need to know it before downloading the file (otherwise you'd have to 
download all the files to update the UI).


> apply document-wide styling,

I just used the document's own styles.


> reference external style sheets,

I just did that from the document.


> use webfonts, etc...

Since the styles come from the document, the fonts come from there too.


> I actually quite like the general idea behind Silvia's 
> http://wiki.xiph.org/Timed_Divs_HTML
> 
> This is somewhat similar to the <timerange> proposal that David Singer 
> and Eric Carlson from Apple have brought up a few times.

I am very reluctant to have such a verbose format be used for such dense 
data files as captions. It works for HTML because the use of markup is 
balanced with the text (though it can get pretty borderline, e.g. the HTML 
spec itself has a high ratio of markup to text). It's not a nice format 
for more dense data, IMHO.


> No matter the syntax, the idea is basically to allow marking up certain 
> parts of HTML as being relevant for certain time ranges. A CSS 
> pseudo-selector matches the elements which are currently active, based 
> on the current time of the media.
> 
> So, the external subtitle file could simply be HTML, [...]
>
> Cons:
> - Politics.
> - New format for subtitle authors and tools.
> - Not usable outside the web platform (i.e. outside of web browsers).

The last of these is pretty critical, IMHO.

It would also result in some pretty complicated situations, like captions 
containing <video>s themselves.


> Pros:
> + Styling using CSS and only CSS.

We'd need extensions (for timing, to avoid different caption streams 
overlapping), so I think this would end up being no better than what we've 
ended up with with WebSRT.


> + Well known format to web authors and tools.

SRT is pretty well-known in the subtitling community.


> + High reuse of existing implementations.

I think the incremental cost of implementing WebSRT is pretty minimal; I 
tried to make it possible for a browser to reuse all the CSS 
infrastructure, for instance.


> + You could author CSS to make the HTML document read as a transcript 
> when opened directly.

That isn't a use case I considered. Is it a use case we should address?


> + <timerange> reusable for page-embedded timed markup, which was the 
> original idea.

I didn't end up addressing this use case. I think if we do this we should 
seriously consider how it interacts with SMIL/SVG. I also think it's 
something we should look at in conjunction with synchronising multiple 
<video> or <audio> elements, e.g. to do audio descriptions, dubbing, 
sign-language video overlays, split-screen video, etc.


On Wed, 14 Apr 2010, Jonas Sicking wrote:
> 
> I really do hate to come up with a new format. But I think TTML is 
> severely off the mark for what we want. Am I wrong in that marking up 
> dialogue vs. sound effects vs. narrator vs. descriptions is important? 
> Or at least more useful than for example the ability to set the text 
> outline blur radius?

Incidentally, I ended up with the following predefined voices: The
string "narrator", "music", "lyric", "sound", "comment", and "credit". Are 
there others we should add?


On Fri, 16 Apr 2010, Silvia Pfeiffer wrote:
> 
> I guess the problem is more with char sets. For HTML pages and other Web 
> content, there is typically information inside the resource that tells 
> you what character set the document is written in. E.g. HTML pages have 
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">. 
> Such functionality is not available for SRT, so it is impossible for a 
> browser to tell what charset to use to render the content in.

Indeed. See the discussion at the top of this e-mail for my current 
thoughts on this.


> And yes, we have made an adjustment in the Media Associations spec for 
> <track> to contain a hint on what mime type and charset the external 
> document is specified in. But that is only a bad fix of SRT's problem. 
> It should be available inside the file so that any application can use 
> the SRT file without requiring additional information.

It's not clear to me how this information would be used. Do you have 
examples of other formats having this metadata? What do UAs do with it in 
those cases? I didn't come across anything compelling in my research of 
other formats and UAs.


> The extended SRT file will barely have anything in common with the 
> original ones.

That isn't really the case.


> For example:
> 
> (1) original SRT file:
> 
> ---
> 1
> 00:00:15,000 --> 00:00:17,951
> At the left we can see...
> 
> 2
> 00:00:18,166 --> 00:00:20,083
> At the right we can see the...
> ---

That's a complete and valid WebSRT file.


> (2) possibly new extended SRT file:
> 
> ---
> Content-Type: text/html; charset=UTF-8
> Content-Language: en_us
> Styles:
> {
>   div.left-align {
>     font-family:Arial,Helvetica,sans-serif;
>     text-align: left;
>   }
>   div.left-right {
>     font-family:Courier New, monospace;
>     text-align: right;
>   }
>   div.speaker {
>     font-family:Courier New, monospace;
>     text-align: left;
>     font-weight: bold;
>   }
> 
> 1
> 00:00:15,000 --> 00:00:17,951
> <div class="left-align speaker"><img src="proof.png"
> role="presentation" alt="Proog icon"/>Proog:</div>
> <div class="left-align" style="color: green;">At the <span
> style="font-style:italic;">left</span> we can <a
> href="looking_left.html">see</a>...</div>
> 
> 2
> 00:00:18,166 --> 00:00:20,083
> <div class="right-align" style="color: blue;">At the right we can <a
> href="looking_right.html">see</a> the...</div>
> ---

That isn't. :-)


> (3) TTML file: (no hyperlinks, no images - just for comparison)
> 
> ---
> <?xml version="1.0" encoding="utf-8"?>
> <tt xml:lang="en_us" xmlns="http://www.w3.org/ns/ttml">
>   <head>
>     <styling>
>       <style xml:id="left-align"
>         tts:fontFamily="proportionalSansSerif"
>         tts:textAlign="left"
>       />
>       <style xml:id="right-align"
>         tts:fontFamily="monospaceSerif"
>         tts:textAlign="right"
>       />
>       <style xml:id="speaker"
>         tts:fontFamily="monospaceSerif"
>         tts:textAlign="left"
>         tts:fontWeight="bold"
>       />
>     </styling>
>     <layout>
>       <region xml:id="subtitleArea"
>         tts:extent="560px 62px"
>         tts:padding="5px 3px"
>       />
>     </layout>
>   </head>
>   <body region="subtitleArea">
>     <div>
>       <p style="left-align" begin="0.15s" end="0.17s 951ms">
>         <div style="speaker">Proog:</div>
>         <div tts:color="green">At the <span
> tts:fontStyle="italic">left</span> we can see...</div>
>       </p>
>       <p style="right-align" begin="0.18s 166ms" end="0.20s 83ms">
>         <div tts:color="green">At the right we can see the...</div>
>       </p>
>     </div>
>   </body>
> </tt>
> ---

Here's the minimalist WebSRT version:

 ---
 00:00:15,000 --> 00:00:17,951 A:start L:-2
 <1>Proog:

 00:00:15,000 --> 00:00:17,951 A:start L:-1
 At the left we can see...
 
 00:00:18,166 --> 00:00:20,083 A:end L:-1
 At the right we can see the...
 ---

You can style it further:

 --
 ::cue { color: green; font: 1em sans-serif; }
 ::cue-part(1) { font: bold 1em serif; }
 --

That seems way simpler than the versions above. :-)


On Fri, 16 Apr 2010, Richard Watts wrote:
> 
> [...] but the consensus in broadcast (OCAP, HbbTV, DVB) is that the 
> Right Way to do this is to fire Javascript events from events in the 
> input stream - one of them being timecode, so something like:
> 
>  video.request_events(video.EVENT_TYPE_TIMED, { from: 10.0, to: 20.0},
>    myfunc)
> 
> And myfunc then gets called whenever you go into or out of that time 
> range. Obviously there's horribleness around the edges for requesting 
> periodic events, etc.
> 
> This should let you write your SRT viewer in Javascript, should you feel 
> that way perverted^Winclined.

The API handles this case in a similar way (you create a timed track for 
yourself, then create a cue for each of the time ranges you want, and 
can register an event either on the track or on individual cues).


On Fri, 23 Apr 2010, Sam Dutton wrote:
>
> [...] film archives used to have no more than a one-sentence description 
> for an entire can of film, illegibly written in a book or on index cards 
> or (more recently) with a few lines in a Word document. Now, digitised 
> footage will often be provided alongside timed, digitised metadata: high 
> quality, structured, detailed, shot-level, frame accurate data about 
> content, location, personnel, dialogue, rights, ratings and more. 
> Accelerating digitisation is at the heart of this 'granularisation', 
> obviously, but a variety of technologies contribute: linked data and 
> semantic markup, temporal URLs, image recognition (show me frames in 
> this video with a car in them), M3U / HTTP streaming, and so on -- even 
> the new iPhone seekToTime method.
> 
> So, in addition to what's on offer in the specs, I'm wondering if it 
> might be possible to have time-aligned *data*, with custom roles.

WebSRT can be used for this, yes.


> For example, imagine a video with a synchronised 'chapter' carousel 
> below it (like the R&DTV demo at 
> http://www.bbc.co.uk/blogs/rad/2009/08/html5.html). The video element 
> would have a track with 'chapter' as its role attribute, and the 
> location of the chapter data file as its src. The data file would 
> consist of an array of 'chapter' objects, each representing some timed 
> data. Every object in the track source would require a start and/or end 
> values, and a content value with arbitrary properties:
> 
> {
>     start: 10.00,
>     end: 20.00,
>     content: {
>         title: "Chapter 2",
>         description: "Some blah relating to chapter 2",
>         image: "/images/chapter2.png"
>    }
> },
> {
>     start: 20.00,
>     end: 30.00,
>     content: {
>         title: "Chapter 3",
>         description: "Chapter 3 blah",
>         image: "/images/chapter3.png"
>    }
> }

In WebSRT, this would be:

   10:00.000 --> 20:00.000
   { title: "Chapter 2", description: "Some blah relating to chapter 2", image: "/images/chapter2.png" }

   20:00.000 --> 30:00.000
   { title: "Chapter 3", description: "Chapter 3 blah", image: "/images/chapter3.png" }

(Here I'm assuming that you want to store the data as JSON. For 
kind=metadata files, you can put anything you want in the cue so long as 
you don't have a blank line in there.)


> In this example, selecting the chapter track for the video would cause 
> the video element to emit segment entry/exit events -- a bit like the 
> Cue Ranges idea. In this example, each event would correspond to an 
> object in the chapter data source.

This is consistent with what the spec now has. Each WebSRT cue gets events 
fired on it when it goes in and out.


On Tue, 20 Apr 2010, Silvia Pfeiffer wrote:
> 
> Firstly about the Lyrics. I think they are just the same as captions and 
> should go back into the first document. In particular since we are 
> talking about captions and subtitles for both the <video> and the 
> <audio> element and this shows some good examples of how lyrics are 
> being displayed as time-aligned text with audio resources. Most of these 
> examples are widgets used on the Web, so I think they are totally 
> relevant.
> 
> Lyrics (LRC) files typically look like this:
> 
> [ti:Can't Buy Me Love]
> [ar:Beatles, The]
> [au:Lennon & McCartney]
> [al:Beatles 1 - 27 #1 Singles]
> [by:Wooden Ghost]
> [re:A2 Media Player V2.2 lrc format]
> [ve:V2.20]
> [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love,
> <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60>
> [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love,
> <00:08.00>love<00:08.90>
> 
> There is some metadata at the start and then there are time fragments, 
> possibly overloaded with explicit subtiming for individual works in 
> karaoke-style. This is not very different from SRT and in fact should 
> fit with your Karaoke use case.

I've used the <mm:ss.hh> idea from LRC to augment WebSRT with this kind of 
metadata.


> I'm also confused about the removal of the chapter tracks. These are
> also time-aligned text files and again look very similar to SRT.

I've also included support for chapters. Currently this support is not 
really fully fleshed out; in particular it's not defined how a UA should 
get chapter names out of the WebSRT file. I would like implementation 
feedback on this topic -- what do browser vendors envisage exposing in 
their UI when it comes to chapters? Just markers in the timeline? A 
dropdown of times? Chapter titles? Styled, unstyled?

Currently a cue payload can be either cue text (simple markup) or metadata 
text (arbitrary data for scripts). We could add a third form consisting of 
just plain text for chapter titles, or we could reuse cue text, depending 
on what is needed here. Currently the spec requires them to be cue text 
but doesn't say how to get plain text out of them.


On Sat, 22 May 2010, Carlos Andrés Solís wrote:
> 
> As you might know, there are basic subtitle formats that are formed by 
> timed plain text and little else (like SRT or the proposed WebSRT), and 
> there are full-blown subtitle formats that allow for extreme formatting 
> and typesetting (like Advanced SubStation Alpha).

WebSRT allows for "extreme" formatting and typesetting too, at least in 
principle, via CSS. The difference is that ASSA has inline formatting, 
like the "bad old days" of <font> on the Web, whereas WebSRT has 
structured text and a separate styling layer, which leads to more 
consistent effects and easier-to-maintain captions.


> The basic subtitles have the advantage of being easily editable by hand, 
> but sacrificing capabilities that advanced formats allow with the cost 
> of harder-to-understand syntax.

I don't think that sacrifice is necessary; as noted above, WebSRT can do 
much just with CSS. I haven't checked it feature-for-feature against ASSA, 
but I would expect there to be a wash. Having everything that CSS can do, 
or even a big chunk of it, is not to be scoffed at. For example, as soon 
as we add gradient paint servers to CSS, we have them in WebSRT. Add 3D 
perspective transforms to CSS, and we have them in WebSRT. (Both of these 
are being added to CSS, this isn't hypothetical.)


On Sun, 23 May 2010, Odin Omdal Hørthe wrote:
> 
> I want to use it for slide sync information. Right now I have a 
> websocket that listens for the file name it wants to show (as next image 
> slide), but it's IMPOSSIBLE to sync video/slides with the features that 
> are in browsers now. Because I can't get the real timestamp from the Ogg 
> Theora video. Also, having that "what slide are we on" information in 
> the video stream is also rather nice.
> 
> If WebSRT had classes, it could be used for all sorts of things. You 
> would parse the incoming WebSRT-stream in javascript and use stylesheets 
> to build text overlays like youtube has etc. Always a tradeoff between 
> easy format and features. If you could optionally put html in there to 
> do advanced stuff that might work. With some rules for readers that 
> don't understand 3., just stripping all tags and showing the result; or 
> even have a method for saying "don't show this if you don't support 
> fancy stuff". I might be trying to mix this in a bad way.
> 
> Anyway, as of now I'm just waiting for a way to tell my webapp what 
> slide we're on (sync with the live streaming video).

WebSRT has classes, if I understand you correctly (search for "voice").

With metadata tracks, you can do everything YouTube does -- you could 
basically put paint commands into your WebSRT, and paint a new canvas each 
time a new cue arrives, if you wanted.

With the events cues get, you can also sync slides pretty easily now.


On Sun, 23 May 2010, Odin Omdal Hørthe wrote:
> 
> I start the streaming.
> 
> 10 minutes later, Leslie connects and gets the video maybe 1 minute 
> delayed because of IceCast2 buffering.
> 
> Her browser (Chromium and Firefox, haven't tested Opera), starts telling 
> «currentTime» from when SHE started to look at the live stream. Instead 
> of showing when the stream started.
> 
> So it's quite impossible to use that for syncing. I asked about that 
> here in this list, and got the answer that this is what we have 
> startTime property for, -- but it is not implemented correctly in any 
> browsers. startTime would then maybe say 0:00:00 for most clips, but on 
> streaming Leslie would have 0:10:00, and then I can use that for 
> syncing.

Per spec, everyone's currentTime should be the same, and their startTime 
should be whatever the earliest time they can seek to is. If browsers 
haven't implemented that, file bugs.


On Sun, 30 May 2010, Carlos Andrés Solís wrote:
>
> I've been thinking on using an XML-like markup as a format to implement 
> subtitles. XML is reasonable enough to be implemented by both media 
> players and web browsers, and so leaves us with little problems 
> regarding compatibility. My proposal takes most of the elements from 
> ASSA (save for vectorial drawing, that can be done later or be replaced 
> by dingbat fonts in the meanwhile) and renders them in a nifty and 
> easy-to-read file by using markup. Right now I've thought of three or 
> four main markup segments: styles, karaoke-specialized styles, 
> animation-specialized styles and, of course, the time divisions to 
> insert the text. Here is my mock-up file, which I hope is as 
> self-explanatory as I think.
> 
> <subtitle name="Betatesting X c001" creator="AnonIsLegion no Fansub"
> date="2010/05/30">
> 
> <styledef>
> <!-- rtl: right-to-left text; utd: up-to-down text; fontstyle=bold, italic,
> strike, and some others; clipstyle:per_line, per_block; rotations are in
> degrees -->
> <style name="Kyoh" fontsize=18 fontname="serif, DejaVu Sans, Arial Unicode
> MS" fontxscale=100% fontyscale=150% rtl=false utd=false fontstyle="bold"
> fontcolor="red" fontalpha=100% outlinesize=1 outlinecolor="white"
> outlinealpha=fontalpha karaokecolor="orange" karaokealpha=75%
> karaokeoutlinecolor=karaokecolor karaokeoutlinealpha=karaokealpha
> karaokeoutlinesize=outlinesize karaokeoutlinealpha=karaokealpha shadowsize=1
> shadowcolor="black" shadowalpha=50% blursize=2 blurtype="square"
> blurintensity=1 positionx=90% positiony=50% marginup=1 margindown=1
> marginleft=2 marginright=2 rotationx=0 rotationy=0 rotationz=0
> clipstyle="per_line" clipup=0 clipdown=0 clipleft=0 clipright=0 />
> <style name="Kenji" inherits="Kyoh" fontcolor="blue" />
> 
> <karstyle name="kardefault" start=0:00 end=0:00.1 karaokecolor="white"
> karaokealpha=75% karaokeoutlinecolor="gray" karaokeoutlinealpha=karaokealpha
> karaokeoutline size=outlinesize />
> 
> <animation name="animdefault">
>     <!-- start and end times here are either absolute from the section
> start, or relative to the chronological order (default); ordering can be
> overriden using from=element_you_want_to_start_times_from -->
>     <animsection name="1" order="absolute" start=0:00 end=0:00.5
> acceleration=100% accelerationtype="constant"> <!-- accelerationtype:
> constant, exponential -->
>         <style positionx=90% positiony=50%>
>     </animsection>
>     <animsection name="2" from="1" inherits="1" start=0:00 end=0:00.5>
>         <style positionx=10% rotationz=180>
>     </animsection>
> </animation>
> </styledef>
> 
> <defaultstyle setstyle="Kyoh" setkarstyle="kardefault"
> setanimstyle="animdefault">
> 
> <time start=0:00 end=0:00.001>
> A very brief line.<br>
> Above another very brief line.
> </time>
> 
> <time start=0:05 end=0:05.5>
> <kar>Ka</kar><kar>ra</kar><kar><style karaokecolor="red">o</style></kar><kar
> end=0:00.2>ke!</kar>
> </time>
> 
> <time start=0:07 end=0:08>
> <animation name="animdefault">
> This text shall move.
> </animation>
> 
> <!-- TODO: Importing images and SVGs to be used in subtitles. In the
> meanwhile a good set of dingbat fonts should suffice. -->
> 
> </subtitle>

Here's the equivalent WebSRT (I took some guesses as to what each bit 
meant):

   00:00.000 --> 00:00.001
   A very brief line.
   Above another very brief line.

   00:05.000 --> 00:05.500
   Ka<00:00.000>ra<00:00.000><b>o</b><00:00.200>ke

   00:07.000 --> 00:08.000
   <1>This text shall move

...with the following CSS (as far as I could work out):

   ::cue {
     font: bold 18px serif, DejaVu Sans, Arial Unicode MS;
     color: red;
     text-shadow: rgba(0,0,0,0.5) 1 1 2;
   }
   ::cue(future) { color: rgba(255,128,0,0.75); }

   ::cue-part(b future) { color: red; }
   ::cue-part(1) {
      -webkit-animation-name: animdefault; 
      -webkit-animation-duration: 1s;
   }

   @-webkit-keyframes animdefault {
     /* currently movement isn't possible; we should add 
        left/right padding at some point */
     from { ... }
     to { ... }
   }

The moving text thing isn't currently supported, but that's just because I 
haven't gone down the list of properties to make everything work yet.

I feel pretty confident in saying that the WebSRT version is clearer.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list