[whatwg] WebVTT feedback (and some other <video> feedback that snuck in)

Thu Dec 1 16:34:15 PST 2011

Please note that WebVTT has moved to a Community Group. The specification 
is now here:

   http://dev.w3.org/html5/webvtt/

I recommend sending further feedback on this specification to the CG's 
mailing list:

   http://lists.w3.org/Archives/Public/public-texttracks/


On Thu, 2 Jun 2011, Glenn Maynard wrote:
> On Thu, Jun 2, 2011 at 7:28 PM, Ian Hickson <ian at hixie.ch> wrote:
> > We can add comments pretty easily (e.g. we could say that "<!" starts 
> > a comment and ">" ends it -- that's already being ignored by the 
> > current parser), if people really need them. But are comments really 
> > that useful? Did SRT have problem due to not supporting inline 
> > comments? (Or did it support inline comments?)
> 
> I've only worked with SSA subtitles (fansubbing), where {text in braces} 
> effectively worked as a comment.  We used them a lot to communicate 
> between editors on a phrase-by-phrase basis.
> 
> But for that use case, using hidden spans makes more sense, since you 
> can toggle them on and off to view them inline, etc.
> 
> Given that, I'd be fine with a comment format that doesn't allow mid-cue 
> comments, if it makes the format simpler.

Well right now we don't "allow" comments at all, but we do technically 
support them both at the inline and block level, so we can add them later 
if there's a good use case.

There is some discussion of use cases for comments here:

   http://www.w3.org/Bugs/Public/show_bug.cgi?id=14552


> >> The text on the left is a transcription, the top is a 
> >> transliteration, and the bottom is a translation.
> >
> > Aren't these three separate text tracks?
> 
> They're all in the same track, in practice, since media players don't 
> play multiple subtitle tracks.
> 
> It's true that having them in separate tracks would be better, so they 
> can be disabled individually.  This is probably rare enough that it 
> should just be sorted out with scripts, at least to start.

Since we support multiple tracks, I think that's sufficient.


> > It's not clear to me that we need language information to apply proper 
> > font selection and word wrapping, since CSS doesn't do it.
> 
> But it doesn't have to, since HTML does this with @lang.

HTML doesn't do any font selection or word wrapping.

Per the HTML and CSS specs, lang="" has no effect on rendering.


> > Mixing one CJK language with one non-CJK language seems fine. That 
> > should always work, assuming you specify good fonts in the CSS.
> 
> The font is ultimately in the user's control.  I tell Firefox to always 
> use Tahoma for Western text and MS Gothic for Japanese text, ignoring 
> the often ugly site-specified fonts.

Yeah, you won't be able to do that with WebVTT, at least not today.


> The most straightforward solution would seems to be having @lang be a 
> CSS property; I don't know the rationale for this being done by HTML 
> instead.

Language is a property of the content, it's not a presentation feature 
that could change based on the device, media, author whim, etc.


> > I don't understand why we can't have good typography for CJK and 
> > non-CJK together. Surely there are fonts that get both right?
> 
> I've never seen a Japanese font that didn't look terrible for English 
> text.  Also, I don't want my font selection to be severely limited due 
> to the need to use a single font for both languages, instead of using 
> the right font for the right text.

Instead of working around poor fonts in all our various languages, we 
should just fix the fonts.


> >> One example of how this can be tricky: at 0:17, a caption on the 
> >> bottom wraps and takes two lines, which then pushes the line at 0:19 
> >> upward (that part's simple enough). Â If instead the top part had 
> >> appeared first, the renderer would need to figure out in advance to 
> >> push it upwards, to make space for the two-line caption underneith 
> >> it. Otherwise, the captions would be forced to switch places.
> >
> > Right, without lookahead I don't know how you'd solve it. With 
> > lookahead things get pretty dicey pretty quickly.
> 
> The problem is that, at least here, the whole scene is nearly 
> incomprehensible if the top/bottom arrangement isn't maintained. Lacking 
> anything better, I suspect authors would use similar brittle hacks with 
> WebVTT.

Authors who want to handle this can specify explicitly which lines each 
cue should appear on. That's already supported. But I don't see any way to 
do this automatically in a sane manner.


> >> I think that, no matter what you do, people will insert line breaks 
> >> in cues. Â I'd follow the HTML model here: convert newlines to spaces 
> >> and have a separate, explicit line break like <br> if needed, so 
> >> people don't manually line-break unless they actually mean to.
> >
> > The line-breaks-are-line-breaks feature is one of the features that 
> > originally made SRT seem like a good idea. It still seems like the 
> > neatest way of having a line break.
> 
> But does this matter?  Line breaks within a cue are relatively uncommon 
> in my experience (perhaps it's different for other languages), compared 
> to how many people will insert line breaks in a text editor simply to 
> break lines while authoring.

In English, at least, it seems pretty common. Almost all cues seem to get 
manually line-broken so as to maintain line length balance, for instance.


> If you do this while testing on a large monitor, it's likely to look 
> reasonable when rendered; the brokenness won't show up until it's played 
> in a smaller window.  Anyone using a non-programmer's text editor that 
> doesn't handle long lines cleanly is likely to do this.

Actually the size isn't such a big deal since the font size is just based 
on the video size.


> Wrapping lines manually in SRTs also appears to be common (even 
> standard) practice, perhaps due to inadequate line wrapping in SRT 
> renderers.  Making line breaks explicit should help keep people from 
> translating this habit to WebVTT.

I don't really see this habit as a problem, personally.


> >> Related to line breaking, should there be an   escape? 
> >>  Inserting nbsp literally into files is somewhat annoying for 
> >> authoring, since they're indistinguishable from regular spaces.
> >
> > How common would   be?
> 
> I guess the main cases I've used nbsp for don't apply so much to 
> captions, eg. Â© 2011 (likely to come at the start of a caption, so 
> not likely to be wrapped anyway).

Yeah.


> >> We might also consider leaning on users a bit to tell us what they 
> >> want. For example, I think people are pretty used to hitting play and 
> >> then pause to buffer until the end of the video. What if we just used 
> >> our bandwidth heuristics while in the play state, and buffered 
> >> blindly when a pause occurs less than X seconds into a video? I won't 
> >> argue that this is a wonderful solution (or a habit we should 
> >> encourage), but I figured I'd throw a random idea out thereéˆ¥ï¿½
> >
> > That seems like pretty ugly UI. :-)
> 
> Changing buffering modes based on *when* the user pauses is an ugly UI.  
> Pausing to let a video buffer when it's underrunning (regardless of when 
> it's paused) is something easy to understand and that people are used 
> to, though.  I don't know if this is relevant to the spec or just an 
> implementation issue.

The spec allows pretty much any buffering strategy. I don't see what else 
we could do, here, really. Browsers will do what they think is best 
regardless of what the spec says.


> >> I think that pausing shouldn't affect read-ahead buffering behavior. 
> >> I'd suggest another preload value, preload=buffer, sitting between 
> >> "metadata" and "auto". Â In addition to everything loaded by 
> >> "metadata", it also fills the read-ahead buffer (whether the video is 
> >> playing or not).
> >>
> >> - If a page wants prebuffering only (not full preloading), it sets 
> >> preload=buffer. Â This can be done even when the video is paused, so 
> >> when the user presses play, the video starts instantly without 
> >> pausing for a server round-trip like preload=metadata.
> >
> > So this would be to buffer enough to play through assuming the network 
> > remains at the current bandwidth, but no more?
> 
> I suppose that wouldn't work too well: if the video is small then you 
> may as well preload the whole thing, and if it's large then long-term 
> bandwidth estimates aren't going to be very accurate.  (I'm dubious of 
> any behavior based on bandwidth estimations.)

Yeah. I'll just leave this up to the UA ("auto").


On Fri, 3 Jun 2011, Philip JÃ¤genstedt wrote:
> > > > > 
> > > > > There was also some discussion about metadata. Language is 
> > > > > sometimes necessary for the font engine to pick the right glyph.
> > > > 
> > > > Could you elaborate on this? My assumption was that we'd just use 
> > > > CSS, which doesn't rely on language for this.
> > > 
> > > It's not in any spec that I'm aware of, but some browsers (including 
> > > Opera) pick different glyphs depending on the language of the text, 
> > > which really helps when rendering CJK when you have several CJK 
> > > fonts on the system. Browsers will already know the language from 
> > > <track srclang>, so this would be for external players.
> > 
> > How is this problem solved in SRT players today?
> 
> Not at all, it seems. Both VLC and Totem allow setting the character 
> encoding and font used for subtitles in the (global) preferences menu, 
> so presumably you would change that if the default doesn't work. Font 
> switching seems to mainly be an issue when your system has other default 
> fonts than the text you're reading, and it appears that is rare enough 
> that very little software does anything about it, browsers perhaps being 
> an exception.

Interesting.

If we are to add language information to the language, there's four ways 
to do it: inline, cue-level, block-level (a section of the file, e.g. 
setting a default at different points in the file), and file-level.

Inline would look like this:

   WEBVTT

   cue id
   00:00:00.000 --> 00:00:01.000
   <lang en>cue text says <lang fr>bonjour</lang></lang>

File-level would look like this:

   WEBVTT
   language: fr

   cue id
   00:00:00.000 --> 00:00:01.000
   bonjour

I suppose we'd need both. I wouldn't propose cue-level or block-level.

How important is this for v1?


> > If we just ignore content until we hit a line that happens to look 
> > like a timing line, then we are much more constrained in what we can 
> > do in the future. For example, we couldn't introduce a "comment block" 
> > syntax, since any comment containing a timing line wouldn't be 
> > ignored. On the other hand if we keep the syntax as it is now, we can 
> > introduce a comment block just by having its first line include a 
> > "-->" but not have it match the timestamp syntax, e.g. by having it be 
> > "--> COMMENT" or some such.
> 
> One of us must be confused, do you mean something like this?
> 
> 1
> --> COMMENT
> 00:00.000 --> 00:01.000
> Cue text
> 
> Adding this syntax would break the *current* parser, as it would fail in 
> step 39 (Collect WebVTT cue timings and settings) and then skip the rest 
> of the cue. If we want any room for extensions along these lines, then 
> multiple lines preceding the timing line must be handled gracefully.

The spec has unfortunately changed since this e-mail, so we no longer have 
hte forwards-compatibility I was referring to. At this point, any time you 
get a --> you skip to thinking you have a cue. This does unfortunately 
mean we can no longer introduce comment blocks in a backwards-compatible 
way.


> [line wrapping]
> 
> To expand a bit more on the problem and suggested solution, consider the 
> example cue "This sentence is spoken by a single speaker and is 
> presented as a single cue."
> 
> If simple line-wrapping (how browsers currently render text) is used it 
> might be:
> 
> "This sentence is spoken by a single speaker and is presented as a 
> single cue."
> 
> Subtitles tend to be line-wrapped to have more balanced line width, and 
> at least I would certainly much prefer this line wrapping:
> 
> "This sentence is spoken by a single speaker
> and is presented as a single cue."
> 
> Apart from being easier to read, this is also much more suitable for 
> left/right-alignment in cases where that is used to associate the cue 
> with a speaker on screen. With WebVTT, one would have to manually 
> line-break the text to get this result. Apart from wasting the time of 
> the captioner, it will also break if a slightly larger font is used -- 
> you might get this rendering instead:
> 
> "This sentence is spoken by a single
> speaker
> and is presented as a single cue."
> 
> In other cases you might get 4 lines where 3 would have been enough. 
> This is not a theoretical issue, I see it fairly with SRT subtitles 
> rendered at another size than was tested with.
> 
> My suggested solution is to first layout the text using all of the 
> available width. Then, decrease the width as much as possible without 
> increasing the number of line breaks. The algorithm should also prefer 
> to make the first line the longest, as this is IMO more aesthetically 
> pleasing.
> 
> I would like to see this specified and would gladly implement it in 
> Opera, but in which spec does it belong? It seems fairly 
> subtitling-specific to me, so if it could be in the WebVTT rendering 
> rules to begin with (as opposed to CSS with vendor prefixes) that would 
> be at least short-term awesome. It's only if this is the default 
> line-wrapping for <track>+WebVTT that people are going to discover this 
> and stop manually line-breaking their captions.

I think the basic algorithm should be in CSS, as it is useful in other 
contexts too (e.g. headings). We can then use that white-space value.

Currently we do have line wrapping (white-space:normal is assumed, with 
the additional hard-wrap-at-edge constraint).


> > On Wed, 19 Jan 2011, Philip JÃ¤genstedt wrote:
> > > 
> > > The 3 preload states imply 3 simple buffering strategies:
> > > 
> > > none: don't touch the network at all
> > > preload: buffer as little as possible while still reaching readyState
> > > HAVE_METADATA
> > > auto: buffer as fast and much as possible
> > 
> > "auto" isn't "as fast and much as possible", it's "as fast and much as
> > will make the user happy". In some configurations, it might be the same as
> > "none" (e.g. if the user is paying by the byte and hates video).
> 
> The way I see it, that's just a matter of a user preference to limit the
> internal preload state to "none" regardless of what the content attribute.

My point is that the preload="" attribute is a hint, it does not have hard 
conformance requirements. It can't have hard conformance requirements 
since browsers have to be able to ignore it based on user preferences.


> > > However, the state we're discussing is when the user has begun 
> > > playing the video. The spec doesn't talk about it, but I call it:
> > > 
> > > invoked: buffer as little as possible without readyState dropping 
> > > below HAVE_FUTURE_DATA (in other words: being able to play from 
> > > currentTime to duration at playbackRate without waiting for the 
> > > network)
> > 
> > There's also a fifth state, let's call it "aggressive", where even 
> > while playing the video the UA is trying to download the whole thing 
> > in case the connection drops.
> 
> This is the same as "auto" for now, but sure, that could be improved.

Again, my point is just that the number of possible behaviours is far in 
excess to the number of possible hints. I list some other possible 
behaviours here:

   http://www.w3.org/Bugs/Public/show_bug.cgi?id=11602#c8


> > If you like I can make the spec explicitly describe what the 
> > preload="" hints mean while video is playing, too.
> 
> That would be a good start. In Opera, playing the video causes the 
> internal preload state to go to "invoked".

Ok. I'll make that change. Proposed diff:

Index: source
===================================================================

--- source	(revision 6837)
+++ source	(working copy)
@@ -31227,74 +31227,88 @@ interface <dfn>HTMLAudioElement</dfn> : 
     will end up firing a <code
     title="event-media-suspend">suspend</code> event, as described
     earlier.</p>
 
    </li>
 
    <!-- this step is mentioned above, search for "FINAL STEP" -->
    <li><p>If the user agent ever reaches this step (which can only
    happen if the entire resource gets loaded and kept available):
    abort the overall <span
    title="concept-media-load-algorithm">resource selection
    algorithm</span>.</p></li>
 
   </ol>
 
   </div>
 
   <hr>
 
   <p>The <dfn title="attr-media-preload"><code>preload</code></dfn>
-  attribute is an <span>enumerated attribute</span>. The following table
-  lists the keywords and states for the attribute — the keywords
-  in the left column map to the states in the cell in the second
-  column on the same row as the keyword.</p>
+  attribute is an <span>enumerated attribute</span>. The following
+  table lists the keywords and states for the attribute — the
+  keywords in the left column map to the states in the cell in the
+  second column on the same row as the keyword. The attribute can be
+  changed even once the <span>media resource</span> is being buffered
+  or played; the descriptions in the table below are to be interpreted
+  with that in mind.</p>
 
   <table>
    <thead>
     <tr>
      <th> Keyword
      <th> State
      <th> Brief description
    <tbody>
     <tr>
      <td><dfn title="attr-media-preload-none"><code>none</code></dfn>
      <td><dfn title="attr-media-preload-none-state">None</dfn>
      <td>Hints to the user agent that either the author does not expect the user to need the media resource, or that the server wants to minimise unnecessary traffic.
+         This state does not provide a hint regarding how aggressively to actually download the media resource if buffering starts anyway (e.g. once the user hits "play").
     <tr>
      <td><dfn title="attr-media-preload-metadata"><code>metadata</code></dfn>
      <td><dfn title="attr-media-preload-metadata-state">Metadata</dfn>
      <td>Hints to the user agent that the author does not expect the user to need the media resource, but that fetching the resource metadata (dimensions, first frame, track list, duration, etc) is reasonable. If the user agent precisely fetches no more than the metadata, then the <span>media element</span> will end up with its <code title="dom-media-readyState">readyState</code> attribute set to <code title="dom-media-HAVE_METADATA">HAVE_METADATA</code>; typically though, some frames will be obtained as well and it will probably be <code title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code> or <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code>.
+         When the media resource is playing, hints to the user agent that bandwidth is to be considered scarce, e.g. suggesting throttling the download so that the media data is obtained at the slowest possible rate that still maintains consistent playback.
     <tr>
      <td><dfn title="attr-media-preload-auto"><code>auto</code></dfn>
      <td><dfn title="attr-media-preload-auto-state">Automatic</dfn>
      <td>Hints to the user agent that the user agent can put the user's needs first without risk to the server, up to and including optimistically downloading the entire resource.
   </table>
 
   <p>The empty string is also a valid keyword, and maps to the <span
   title="attr-media-preload-auto-state">Automatic</span> state. The
   attribute's <i>missing value default</i> is user-agent defined,
   though the <span
   title="attr-media-preload-metadata-state">Metadata</span> state is
   suggested as a compromise between reducing server load and providing
   an optimal user experience.</p>
 
+  <p class="note">Authors might switch the attribute from "<code
+  title="attr-media-preload-none">none</code>" or "<code
+  title="attr-media-preload-metadata">metadata</code>" to "<code
+  title="attr-media-preload-auto">auto</code>" dynamically once the
+  user begins playback. For example, on a page with many videos this
+  might be used to indicate that the many videos are not to be
+  downloaded unless requested, but they once one <em>is</em> requested
+  it is to be downloaded aggressively.</p>
+
   <div class="impl">
 
   <p>The <code title="attr-media-preload">preload</code> attribute is
   intended to provide a hint to the user agent about what the author
   thinks will lead to the best user experience. The attribute may be
   ignored altogether, for example based on explicit user preferences
   or based on the available connectivity.</p>
 
   <p>The <dfn
   title="dom-media-preload"><code>preload</code></dfn> IDL
   attribute must <span>reflect</span> the content attribute of the
   same name, <span>limited to only known values</span>.</p>
 
   </div>
 
   <p class="note">The <code
   title="attr-media-autoplay">autoplay</code> attribute can override
   the <code title="attr-media-preload">preload</code> attribute (since
   if the media plays, it naturally has to buffer first, regardless of
   the hint given by the <code


> > On Thu, 20 Jan 2011, Philip JÃ¤genstedt wrote:
> > > 
> > > There have been two non-trivial changes to the seeking algorithm in the
> > > last year:
> > > 
> > > Discussed at
> > > http://lists.w3.org/Archives/Public/public-html/2010Feb/0003.html
> > > lead to http://html5.org/r/4868
> > > 
> > > Discussed at
> > > http://lists.w3.org/Archives/Public/public-html/2010Jul/0217.html
> > > lead to http://html5.org/r/5219
> > 
> > Yeah. In particular, sometimes there's no way for the UA to know 
> > asynchronously if the seek can be done, which is why the attribute is 
> > set after the method returns. It's not ideal, but the alternative is 
> > not always implementable.
> > 
> > > With that said, it seems like there's nothing that guarantees that 
> > > the asynchronous section doesn't start running while the script is 
> > > still running.
> > 
> > Yeah. It's not ideal, but I don't really see what we can do about it.
> 
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=12267
> 
> By only updating the media state between tasks (or as tasks), the script 
> that issued the seek could not see the state changed as a result of it.

I'm willing to consider concrete focused suggestions for making specific 
changes to this API to make it more predictable, but at a high level I 
think I've done pretty much as much as can be done on this issue. A slow 
script can still observe asynchronous changes after seeking, e.g. if you 
try to seek a media resource that can't be seeked, then you'll see seeking 
briefly return true before going back to false.


> Changing currentTime synchronously doesn't mean that seeking to that 
> position will actually succeed, so I don't see why that would be a 
> problem. currentTime would just be updated again once it's been clamped 
> in the asynchronous section of the seek algorithm.

This is more or less what the spec now says.


On Sat, 4 Jun 2011, Silvia Pfeiffer wrote:
> 
> Icecast streams have chained files, so streaming Ogg to an audio element 
> would hit this problem. There is a bug in FF for this: 
> https://bugzilla.mozilla.org/show_bug.cgi?id=455165 [...]. There's also 
> a webkit bug for icecast streaming, which is probably related 
> https://bugs.webkit.org/show_bug.cgi?id=42750 . I'm not sure how Opera 
> is able to deal with icecast streams, but it seems to deal with it.
> 
> The thing is: you can implement playback and seeking without any further 
> changes to the spec. But then the browser-internal metadata states will 
> change depending on the chunk you're on. Should that also update the 
> exposed metadata in the API then? Probably yes, because otherwise the JS 
> developer may deal with contradictory information. Maybe we need a 
> "metadatachange" event for this?

What metadata will change? Just videoWidth and videoHeight? (duration can 
change too, but that's already handled in the spec) Do the height and 
width values change again when you seek back to the previous link in the 
chain? Presumably they do. Do we need to expose the videoWidth and 
videoHeight at any particular time, or can we just say it's the current 
values and fire an event when they change? (What's the use case for 
knowing when they change?) Notice that the spec already handles the 
timeline aspect of chained resources; basically the first file sets the 
timeline and the others share it, ignoring their internal times.


> > On Tue, 24 May 2011, Silvia Pfeiffer wrote:
> >>
> >> Ian and I had a brief conversation recently where I mentioned a 
> >> problem with extended text descriptions with screen readers (and 
> >> worse still with braille devices) and the suggestion was that the 
> >> "paused for user interaction" state of a media element may be the 
> >> solution. I would like to pick this up and discuss in detail how that 
> >> would work to confirm my sketchy understanding.
> >>
> >> *The use case:*
> >>
> >> In the specification for media elements we have a <track> kind of
> >> "descriptions", which are:
> >>
> >> "Textual descriptions of the video component of the media resource, 
> >> intended for audio synthesis when the visual component is unavailable 
> >> (e.g. because the user is interacting with the application without a 
> >> screen while driving, or because the user is blind). Synthesized as a 
> >> separate audio track."
> >>
> >> I'm for now assuming that the synthesis will be done through a screen 
> >> reader and not through the browser itself, thus making the 
> >> descriptions available to users as synthesized audio or as braille if 
> >> the screen reader is set up for a braille device.
> >>
> >> The textual descriptions are provided as chunks of text with a start 
> >> and an end time (so-called "cues"). The cues are processed during 
> >> video playback as the video's playback time starts to fall within the 
> >> time frame of the cue. Thus, it is expected the that cues are 
> >> consumed during the cue's time frame and are not present any more 
> >> when the end time of the cue is reached, so they don't conflict with 
> >> the video's normal audio.
> >>
> >> However, on many occasions, it is not possible to consume the cue 
> >> text in the given time frame. In particular not in the following 
> >> situations:
> >>
> >> 1. The screen reader takes longer to read out the cue text than the 
> >> cue's time frame provides for. This is particularly the case with 
> >> long cue text, but also when the screen reader's reading rate is 
> >> slower than what the author of the cue text expected.
> >>
> >> 2. The braille device is used for reading. Since reading braille is 
> >> much slower than listening to read-out text, the cue time frame will 
> >> invariably be too short.
> >>
> >> 3. The user seeked right into the middle of a cue and thus the time 
> >> frame that is available for reading out the cue text is shorter than 
> >> the cue author calculated with.
> >>
> >> Correct me if I'm wrong, but it seems that what we need is a way for 
> >> the screen reader to pause the video element from continuing to play 
> >> while the screen reader is still busy delivering the cue text. (In 
> >> a11y talk: what is required is a means to deal with "extended 
> >> descriptions", which extend the timeline of the video.) Once it's 
> >> finished presenting, it can resume the video element's playback.
> >
> > Is it a requirement that the user be able to use the regular video 
> > pause, play, rewind, etc, controls to seek inside the extended 
> > descriptions
> 
> No, the audio descriptions (which are only text to the browser and turn 
> into audio only through the screen reader) are controlled by the 
> screenreader, not by the video controls. When the user navigates using 
> the video controls, the cues of the audio description change and will be 
> handed to the screenreader, too, so can be read out in sync. But the 
> video controls have no direct control over the read-out audio.
> 
> > or should they literally pause the video while playing, with the audio 
> > descriptions being controlled by the same UI as the screen reader?
> 
> The audio descriptions cannot control the video, since they are just 
> text cues with a start and end time that is supposed to be in sync with 
> the video. The only component that actually knows whether the user has 
> heard the full text of a text cue is the screen reader, since it is 
> turning the text into sound. So, the control over pausing the video must 
> come from there. Indeed, the user should be able to control this through 
> the screen reader UI - e.g. hit a button to skip reading a cue and let 
> the video continue playing uninterrupted.

It sounds to me like what you're saying is that for the case of an audio 
description cue whose text is longer than the cue itself, the UA should 
act as if it had "paused for user interaction". Here's a proposed patch to 
make this more explicit:

Index: source
===================================================================
--- source	(revision 6837)
+++ source	(working copy)
@@ -31816,46 +31816,47 @@ interface <dfn>HTMLAudioElement</dfn> : 
       <code title="dom-media-HAVE_ENOUGH_DATA">HAVE_ENOUGH_DATA</code>,
       then the relevant steps below must then be run also.</p>
 
      </dd>
 
      <!-- going down -->
      <dt>If the previous ready state was <code
      title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code> or more,
      and the new ready state is <code
      title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code> or
      less</dt>
 
      <dd>
 
       <p id="fire-waiting-when-waiting">If the <span>media
       element</span> was <span>potentially playing</span> before its
       <code title="dom-media-readyState">readyState</code> attribute
       changed to a value lower than <code
       title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code>, and
       the element has not <span>ended playback</span>, and playback
-      has not <span>stopped due to errors</span>, and playback has not
-      <span>paused for user interaction</span>, the user agent must
-      <span>queue a task</span> to <span>fire a simple event</span>
-      named <code title="event-media-timeupdate">timeupdate</code> at
-      the element, and <span>queue a task</span> to <span>fire a
-      simple event</span> named <code
+      has not <span>stopped due to errors</span>, <span>paused for
+      user interaction</span>, or <span>paused for in-band
+      content</span>, the user agent must <span>queue a task</span> to
+      <span>fire a simple event</span> named <code
+      title="event-media-timeupdate">timeupdate</code> at the element,
+      and <span>queue a task</span> to <span>fire a simple
+      event</span> named <code
       title="event-media-waiting">waiting</code> at the element.</p>
 
      </dd>
 
      <!-- going up to future -->
      <dt>If the previous ready state was <code
      title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code> or
      less, and the new ready state is <code
      title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code></dt>
 
      <dd>
 
       <p>The user agent must <span>queue a task</span> to <span>fire a
       simple event</span> named <code
       title="event-media-canplay">canplay</code>.</p>
 
       <p>If the element's <code title="dom-media-paused">paused</code>
       attribute is false, the user agent must <span>queue a task</span>
       to <span>fire a simple event</span> named <code
       title="event-media-playing">playing</code>.</p>
@@ -32052,51 +32053,60 @@ interface <dfn>HTMLAudioElement</dfn> : 
     <p>Sets the <code title="dom-media-paused">paused</code> attribute
     to true, loading the <span>media resource</span> if necessary.</p>
 
    </dd>
 
   </dl>
 
   <div class="impl">
 
   <p>The <dfn title="dom-media-paused"><code>paused</code></dfn>
   attribute represents whether the <span>media element</span> is
   paused or not. The attribute must initially be true.</p>
 
   <p>A <span>media element</span> is a <dfn>blocked media
   element</dfn> if its <code
   title="dom-media-readyState">readyState</code> attribute is in the
   <code title="dom-media-HAVE_NOTHING">HAVE_NOTHING</code> state, the
   <code title="dom-media-HAVE_METADATA">HAVE_METADATA</code> state, or
   the <code
   title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code> state,
-  or if the element has <span>paused for user interaction</span>.</p>
+  or if the element has <span>paused for user interaction</span> or
+  <span>paused for in-band content</span>.</p>
 
   <p>A <span>media element</span> is said to be <dfn>potentially
   playing</dfn> when its <code title="dom-media-paused">paused</code>
   attribute is false, the element has not <span>ended playback</span>,
   playback has not <span>stopped due to errors</span>, 
   the element either has no <span>current media controller</span> or
   has a <span>current media controller</span> but is not <span>blocked
   on its media controller</span>,
   and the element is not a <span>blocked media element</span>.</p>
 
+  <p class="note">A <code title="event-media-waiting">waiting</code>
+  DOM event <a href="#fire-waiting-when-waiting">can be fired</a> as a
+  result of an element that is <span>potentially playing</span>
+  stopping playback due to its <code
+  title="dom-media-readyState">readyState</code> attribute changing to
+  a value lower than <code
+  title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code>.</p>
+
   <p>A <span>media element</span> is said to have <dfn>ended
   playback</dfn> when:</p>
 
   <ul>
 
    <li>The element's <code
    title="dom-media-readyState">readyState</code> attribute is <code
    title="dom-media-HAVE_METADATA">HAVE_METADATA</code> or greater,
    and
 
    <li>
 
     <p>Either:
 
     <ul>
 
      <li>The <span>current playback position</span> is the end of the
      <span>media resource</span>, and
 
      <li>The <span>direction of playback</span> is forwards, and
@@ -32140,60 +32150,82 @@ interface <dfn>HTMLAudioElement</dfn> : 
   data</span>, and due to that error, is not able to play the content
   at the <span>current playback position</span>.</p>
 
   <p>A <span>media element</span> is said to have <dfn>paused for user
   interaction</dfn> when its <code
   title="dom-media-paused">paused</code> attribute is false, the <code
   title="dom-media-readyState">readyState</code> attribute is either
   <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code> or
   <code title="dom-media-HAVE_ENOUGH_DATA">HAVE_ENOUGH_DATA</code> and
   the user agent has reached a point in the <span>media
   resource</span> where the user has to make a selection for the
   resource to continue.
   If the <span>media element</span> has a <span>current media
   controller</span> when this happens, then the user agent must
   <span>report the controller state</span> for the <span>media
   element</span>'s <span>current media controller</span>. If the
   <span>media element</span> has a <span>current media
   controller</span> when the user makes a selection, allowing playback
   to resume, the user agent must similarly <span>report the controller
   state</span> for the <span>media element</span>'s <span>current
-  media controller</span>.
-  </p>
+  media controller</span>.</p>
 
   <p>It is possible for a <span>media element</span> to have both
   <span>ended playback</span> and <span>paused for user
   interaction</span> at the same time.</p>
 
   <p>When a <span>media element</span> that is <span>potentially
   playing</span> stops playing because it has <span>paused for user
   interaction</span>, the user agent must <span>queue a task</span> to
   <span>fire a simple event</span> named <code
   title="event-media-timeupdate">timeupdate</code> at the element.</p>
 
-  <p class="note">A <code title="event-media-waiting">waiting</code>
-  DOM event <a href="#fire-waiting-when-waiting">can be fired</a> as a
-  result of an element that is <span>potentially playing</span>
-  stopping playback due to its <code
-  title="dom-media-readyState">readyState</code> attribute changing to
-  a value lower than <code
-  title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code>.</p>
+  <p>A <span>media element</span> is said to have <dfn>paused for
+  in-band content</dfn> when its <code
+  title="dom-media-paused">paused</code> attribute is false, the <code
+  title="dom-media-readyState">readyState</code> attribute is either
+  <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code> or
+  <code title="dom-media-HAVE_ENOUGH_DATA">HAVE_ENOUGH_DATA</code> and
+  the user agent has suspended playback of the <span>media
+  resource</span> in order to play content that is temporally anchored
+  to the <span>media resource</span> and has a non-zero length, or to
+  play content that is temporally anchored to a segment of the
+  <span>media resource</span> but has a length longer than that
+  segment. If the <span>media element</span> has a <span>current media
+  controller</span> when this happens, then the user agent must
+  <span>report the controller state</span> for the <span>media
+  element</span>'s <span>current media controller</span>. If the
+  <span>media element</span> has a <span>current media
+  controller</span> when the user agent unsuspends playback, the user
+  agent must similarly <span>report the controller state</span> for
+  the <span>media element</span>'s <span>current media
+  controller</span>.</p>
+
+  <p class="example">One example of when a <span>media element</span>
+  would be <span>paused for in-band content</span> is when the user
+  agent is playing <span title="attr-track-kind-descriptions">audio
+  descriptions</span> from an external WebVTT file, and the
+  synthesized speech generated for a cue is longer than the time
+  between the <span>text track cue start time</span> and the
+  <span>text track cue end time</span>.</p>
+
+  <hr>
 
   <p>When the <span>current playback position</span> reaches the end
   of the <span>media resource</span> when the <span>direction of
   playback</span> is forwards, then the user agent must follow these
   steps:</p>
 
   <ol>
 
    <li><p>If the <span>media element</span> has a <code
    title="attr-media-loop">loop</code> attribute specified
    and does not have a <span>current media controller</span>,
    then <span title="dom-media-seek">seek</span> to the <span>earliest
    possible position</span> of the <span>media resource</span> and
    abort these steps.</p></li> <!-- v2/v3: We should fire a 'looping'
    event here to explain why this immediately fires a 'playing' event,
    otherwise the 'playing' event that fires from the readyState going
    from HAVE_CURRENT_DATA back to HAVE_FUTURE_DATA will seem
    inexplicable (since the normally matching 'ended' given below event
    doesn't fire in the loop case). -->
 
@@ -39203,48 +39235,48 @@ dictionary <dfn>TrackEventInit</dfn> : <
      <td><dfn title="event-media-loadeddata"><code>loadeddata</code></dfn>
      <td><code>Event</code>
      <td>The user agent can render the <span>media data</span> at the <span>current playback position</span> for the first time.
      <td><code title="dom-media-readyState">readyState</code> newly increased to <code title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code> or greater for the first time.
     <tr>
      <td><dfn title="event-media-canplay"><code>canplay</code></dfn>
      <td><code>Event</code>
      <td>The user agent can resume playback of the <span>media data</span>, but estimates that if playback were to be started now, the <span>media resource</span> could not be rendered at the current playback rate up to its end without having to stop for further buffering of content.
      <td><code title="dom-media-readyState">readyState</code> newly increased to <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code> or greater.
     <tr>
      <td><dfn title="event-media-canplaythrough"><code>canplaythrough</code></dfn>
      <td><code>Event</code>
      <td>The user agent estimates that if playback were to be started now, the <span>media resource</span> could be rendered at the current playback rate all the way to its end without having to stop for further buffering.
      <td><code title="dom-media-readyState">readyState</code> is newly equal to <code title="dom-media-HAVE_ENOUGH_DATA">HAVE_ENOUGH_DATA</code>.
     <tr>
      <td><dfn title="event-media-playing"><code>playing</code></dfn>
      <td><code>Event</code>
      <td>Playback is ready to start after having been paused or delayed due to lack of <span>media data</span>.
      <td><code title="dom-media-readyState">readyState</code> is newly equal to or greater than <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code> and <code title="dom-media-paused">paused</code> is false, or <code title="dom-media-paused">paused</code> is newly false and <code title="dom-media-readyState">readyState</code> is equal to or greater than <code title="dom-media-HAVE_FUTURE_DATA">HAVE_FUTURE_DATA</code>. Even if this event fires, the element might still not be <span>potentially playing</span>, e.g. if
      the element is <span>blocked on its media controller</span> (e.g. because the <span>current media controller</span> is paused, or another <span title="slaved media elements">slaved media element</span> is stalled somehow, or because the <span>media resource</span> has no data corresponding to the <span>media controller position</span>), or
-     the element is <span>paused for user interaction</span>.
+     the element is <span>paused for user interaction</span> or <span>paused for in-band content</span>.
     <tr>
      <td><dfn title="event-media-waiting"><code>waiting</code></dfn>
      <td><code>Event</code>
      <td>Playback has stopped because the next frame is not available, but the user agent expects that frame to become available in due course.
      <td><code title="dom-media-readyState">readyState</code> is equal to or less than <code title="dom-media-HAVE_CURRENT_DATA">HAVE_CURRENT_DATA</code>, and <code title="dom-media-paused">paused</code> is false. Either <code title="dom-media-seeking">seeking</code> is true, or the <span>current playback position</span> is not contained in any of the ranges in <code title="dom-media-buffered">buffered</code>. It is possible for playback to stop for other reasons without <code title="dom-media-paused">paused</code> being false, but those reasons do not fire this event (and when those situations resolve, a separate <code title="event-media-playing">playing</code> event is not fired either): e.g.
      the element is newly <span>blocked on its media controller</span>, or
-     <span title="ended playback">playback ended</span>, or playback <span>stopped due to errors</span>, or the element has <span>paused for user interaction</span>.
+     <span title="ended playback">playback ended</span>, or playback <span>stopped due to errors</span>, or the element has <span>paused for user interaction</span> or <span>paused for in-band content</span>.
    <tbody>
     <tr>
      <td><dfn title="event-media-seeking"><code>seeking</code></dfn>
      <td><code>Event</code>
      <td>The <code title="dom-media-seeking">seeking</code> IDL attribute changed to true.
      <td>
     <tr>
      <td><dfn title="event-media-seeked"><code>seeked</code></dfn>
      <td><code>Event</code>
      <td>The <code title="dom-media-seeking">seeking</code> IDL attribute changed to false.
      <td>
     <tr>
      <td><dfn title="event-media-ended"><code>ended</code></dfn>
      <td><code>Event</code>
      <td>Playback has stopped because the end of the <span>media resource</span> was reached.
      <td><code title="dom-media-currentTime">currentTime</code> equals the end of the <span>media resource</span>; <code title="dom-media-ended">ended</code> is true.
 
    <tbody>
     <tr>
      <td><dfn title="event-media-durationchange"><code>durationchange</code></dfn>


> Do you have an example, then, for when a video actually goes into the 
> state "paused for user interaction"?

e.g. when you're playing a flash video and it reaches a frame where the 
user has to click one of two buttons to decide where the story goes.

It would never happen for Ogg, H.264, or WebM streams.


On Tue, 7 Jun 2011, Philip JÃ¤genstedt wrote:
> 
> An Icecast stream is conceptually just one infinite audio stream, even 
> though at the container level it is several chained Ogg streams. 
> duration will be Infinity and currentTime will be constantly increasing. 
> This doesn't seem to be a case where any spec change is needed. Am I 
> missing something?

Agreed. It's only for chained video that there seems to be anything that 
might change.


On Wed, 8 Jun 2011, Silvia Pfeiffer wrote:
> 
> That is all correct. However, because it is a sequence of Ogg streams, 
> there are new Ogg headers in the middle. These new Ogg headers will lead 
> to new metadata loaded in the media framework - e.g. because the new Ogg 
> stream is encoded with a different audio sampling rate and a different 
> video width/height etc. So, therefore, the metadata in the media 
> framework changes. However, what the browser reports to the JS developer 
> doesn't change. Or if it does change, the JS developer is not informed 
> of it because it is a single infinite audio (or video) stream. Thus the 
> question whether we need a new "metadatachange" event to expose this to 
> the JS developer. It would then also signify that potentially the number 
> of tracks that are available may have changed and other such 
> information.

None of that information is exposed in the first place.


On Wed, 8 Jun 2011, Philip JÃ¤genstedt wrote:
> 
> As for Ogg and WebM, I'm inclined to say that we just shouldn't support 
> that, unless there's some compelling use case for it. There's also the 
> option of tweaking the muxers so that all the streams are known 
> up-front, even if there won't be any data arriving for them until 
> half-way through the file.

Not supporting changes in videoWidth or videoHeight is fine by me...


On Wed, 8 Jun 2011, Eric Carlson wrote:
>
> The characteristics of an Apple HTTP live stream can change on the 
> fly. For example if the user's bandwidth to the streaming server 
> changes, the video width and height can change as the stream resolution 
> is switched up or down, or the number of tracks can change when a stream 
> switches from video+audio to audio only. In addition, a server can 
> insert segments with different characteristics into a stream on the fly, 
> eg. inserting an ad or emergency announcement.
> 
> It is not possible to predict these changes before they occur.

All except the change in video width and height are already supported in 
the current API, as far as I can tell.

For the videoWidth and videoHeight cases, it would help to know what the 
use cases are for those attributes in the case of dynamic changes, so that 
we can determine if we need to expose the state at any point in time, or 
only the state at the current time, or only the state at the last buffered 
point in time, or the dimensions in the case of the highest product of 
width and height, or some other value.


On Thu, 9 Jun 2011, Eric Carlson wrote:
>
>   In addition, it is possible for a stream to lose or gain an audio 
> track. In this case the dimensions won't change but a script may want to 
> react to the change in audioTracks.

This is already handled. (Actually, you can't lose a track currently, 
since you can always seek back to the point where the track was present. 
But I will probably be adding an event to report that the "earliest 
possible position" has increased past the end point of a track, so that 
the track can be GCed. It's not yet specced.)


On Mon, 20 Jun 2011, Mark Watson wrote:
> 
> Also, as Eric (C) pointed out, one of the things which can change is 
> which of several available versions of the content is being rendered 
> (for adaptive bitrate cases). This doesn't necessarily change any of the 
> metadata currently exposed on the video element, but nevertheless it's 
> information that the application may need. It would be nice to expose 
> some kind of identifier for the currently rendered stream and have an 
> event when this changes. I think that a stream-format-supplied 
> identifier would be sufficient.

Is the track identifier insufficient for this purpose?


On Mon, 20 Jun 2011, Silvia Pfeiffer wrote:
> 
> Well, if videoWidth and videoHeight change and no dimensions on the 
> video are provided through CSS, then surely the video will change size 
> and the display will shrink. That would be a terrible user experience. 
> For that reason I would suggest that such a change not be made in 
> alternative adaptive streams.

Realistically, people will almost always set explicit dimensions on their 
<video> elements, especially if they will do adaptive streaming, so I 
don't think we should optimise for the case where the dimensions are not 
explicit.


On Mon, 20 Jun 2011, Mark Watson wrote:
> 
> I think it would be a bad idea to try and re-invent adaptive streaming 
> in HTML itself.

Agreed. IMHO this should be done over the network protocol.


On Sun, 5 Jun 2011, Silvia Pfeiffer wrote [channeling a coworker]:
>
> * A:[start|middle|end]
>  -- If the [subtitle box] and also the [subtitle text] are aligned by
> the designer within a CSS (file), which setting dominates: CSS or cue
> setting, for both [subtitle box] and [subtitle text]?

I don't understand the question.


>  -- As it is text alignment, for me it is alignment of text within the 
> [subtitle text] element only, but not also alignment/positioning of 
> [subtitle text] element in relation to the [subtitle box]! However, 
> Silvia reckons the anchoring of the box changes with the alignment, so 
> that it is possible to actually middle align the [subtitle box] with 
> A:middle. We wonder which understanding is correct.

I don't really understand this question either. Do you have examples that 
would demonstrate what you mean?


> * T:[number]%
>  -- If the [subtitle box] and also the [subtitle text] are aligned by
> the designer within a CSS (file), which setting dominates: CSS or cue
> setting, for both [subtitle box] and [subtitle text]?
> 
> -- What about it if "T" is used together with A:[start|middle|end]?

Isn't this answered exhaustively by the rendering rules? I don't really 
understand why this would be undefined.


> * S:[number]
>  -- If using S:[number] without "%" (percentage) it is not clear
> whether "px" or "em" is the unit for the text size.

Neither, the unit is based on "vw"s. Again, the rendering part of the spec 
seems to make this clear.


> * cue voice tag
>  -- why are we not using voice name declaration like in the cue class
> tags with a dot separation like <v.VoiceName>voice text</v> and
> without spaces (eg. <v VoiceName>). This could avoid errors by .vtt
> file writer and would also be much more clear to implement.

How would you include both a class and a voice name if you did that?
Or include a name with spaces?



> >> Using this syntax, I would expect some confusion when you omit the closing
> >> </v>, when it's *not* a cue spoken by two voices at the same time, such as:
> >>
> >> <v Jim>- Boo!
> >> <v Bob>- Gah!
> >>
> >> Gah! is spoken by both Jim and Bob, but that was likely not intended. If
> >> this causes confusion, we should make validators warn about multiple
> >> voices with with no closing </v>.
> >
> > No need to just warn, the spec says the above is outright invalid, so
> > they would raise an error.
> 
> It would still need parsing. Do we expect it to result in
> <v Jim>- Boo!</v>
> <v Bob>- Gah!
> or
> <v Jim>- Boo!
> <v Jim, Bob>- Gah!
> ?

They would end up nested, not siblings, per the current parser.


> Also, that raises a question: if the "annotation" in the <v> element 
> concerns multiple people, how do we specify that? Are we leaving this 
> completely to author preference or do we want it to be machine parsable?

The use case was for showing the text to the user, so it wouldn't be 
machine readable. You'd just write it the same way you want it to appear 
to the user. You can use class names if you need a machine-referencable 
hook, e.g. for styling:

   <v.male Bob>Hello</v>

   <v.female Nadine>Hi!</v>

   <v.female.male Both>Are you--</v>


> > On Tue, 4 Jan 2011, Alex Bishop wrote:
> >>
> >> Firefox too. If you visit 
> >> http://people.mozilla.org/~jdaggett/webfonts/serbianglyphs.html in 
> >> Firefox 4, the text explicitly marked-up as being Serbian Cyrillic 
> >> (using the lang="sr-Cyrl" attribute) uses some different glyphs to 
> >> the text with no language metadata.
> >
> > This seems to be in violation of CSS; we should probably fix it there 
> > before fixing it in WebVTT since WebVTT relis on CSS.
> 
> Only when used within browsers...

There's no reason standalone players couldn't also support CSS for WebVTT.


> > I'm not sure what you mean by "made part of the WebVTT specification", 
> > but if you mean that WebVTT should support inline CSS, that does seem 
> > line something we can add, e.g. using syntax like this:
> >
> > Â  WEBVTT
> >
> > Â  STYLE-->
> > Â  ::cue(v[voice=Bob]) { color: green; }
> > Â  ::cue(c.narration) { font-style: italic; }
> > Â  ::cue(c.narration i) { font-style: normal; }
> 
> Yup, that's exactly what we need.

I haven't added this yet, but I've filed this bug to not forget about it:

   http://www.w3.org/Bugs/Public/show_bug.cgi?id=15023


> >> WebVTT requires a structure to add header-style metadata. We are here 
> >> talking about lists of name-value pairs as typically in use for 
> >> header information. The metadata can be optional, but we need a 
> >> defined means of adding them.
> >>
> >> Required attributes in WebVTT files should be the main language in 
> >> use and the kind of data found in the WebVTT file - information that 
> >> is currently provided in the <track> element by the @srclang and 
> >> @kind attributes. These are necessary to allow the files to be 
> >> interpreted correctly by non-browser applications, for transcoding or 
> >> to determine if a file was created as a caption file or something 
> >> else, in particular the @kind=metadata. @srclang also sets the base 
> >> directionality for BiDi calculations.
> >>
> >> Further metadata fields that are typically used by authors to keep 
> >> specific authoring information or usage hints are necessary, too. As 
> >> examples of current use see the format of MPlayer mpsubâ€™s header 
> >> metadata [2], EBU STLâ€™s General Subtitle Information block [3], and 
> >> even CEA-608â€™s Extended Data Service with its StartDate, Station, 
> >> Program, Category and TVRating information [4]. Rather than 
> >> specifying a specific subset of potential fields we recommend to just 
> >> have the means to provide name-value pairs and leave it to the 
> >> negotiation between the author and the publisher which fields they 
> >> expect of each other.
> >>
> >> [2] http://www.mplayerhq.hu/DOCS/tech/mpsub.sub
> >> [3] https://docs.google.com/viewer?a=v&q=cache:UKnzJubrIh8J:tech.ebu.ch/docs/tech/tech3264.pdf
> >> [4] http://edocket.access.gpo.gov/cfr_2007/octqtr/pdf/47cfr15.119.pdf
> >
> > I don't understand the use cases here.
> >
> > CSS and JS don't have anything like this, why should WebVTT? What 
> > problem is this solving? How did SRT solve this problem?
> 
> SRT doesn't solve it. That's why it's not being used by professionals 
> for subtitling. Most other subtitling formats, however, have means for 
> including metadata, including formats like LRC for music lyrics. CSS and 
> JS don't have metadata, but HTML has through the meta tag.

If HTML's metadata stuff is enough for CSS and JS, why is it not enough 
for WebVTT? I really don't understand the use case here.


> > Adding defaults seems like a reasonable feature. We could add this just by
> > adding the ability to have a block in a VTT file like this:
> >
> > Â  WEBVTT
> >
> > Â  DEFAULTS --> A:vertical A:end
> >
> > Â  00:00.000 --> 00:02.000
> > Â  This is vertical and end-aligned.
> >
> > Â  00:02.500 --> 00:05.000
> > Â  As is this.
> >
> > Â  DEFAULTS --> A:start
> >
> > Â  00:05.500 --> 00:07.000
> > Â  This is horizontal and start-aligned.
> >
> > However, again I suggest that we wait until WebVTT has been deployed 
> > in at least one browser before adding more features like this.
> 
> This is a good idea. Happy to wait, though there are now implementations 
> that are starting to emerge and in particular these DEFAULTS will be 
> very useful to reduce repetition in authoring from the start.

Filed this bug to keep track of this issue:
   http://www.w3.org/Bugs/Public/show_bug.cgi?id=15024


> There are particularly questions about what L:100% and T:100% mean - do 
> they position the boxes outside the video viewport?

T:100% just puts the bottom of the box at the bottom of the viewport. 
L:100% puts the end of the box at the end of the viewport. I don't see why 
this would go outside the viewport.


> Incidentally: would it make sense to have a pixel-based (or em-based) 
> font size specification for "S" as well as the percentage based one?

Not as far as I can tell.


> >> * naming: The usage of single letter abbreviations for cue settings 
> >> has created quite a discussion here at Google. We all agree that 
> >> file-wide cue settings are required and that this will reduce the 
> >> need for cue-specific cue settings. We can thus afford a bit more 
> >> readability in the cue settings. We therefore believe that it would 
> >> be better if the cue settings were short names rather than single 
> >> letter codes. This would be more like CSS, too, and easier to learn 
> >> and get right. In the interface description, the 5 dimensions have 
> >> proper names which could be re-used (â€œdirectionâ€, 
> >> â€œlinePositionâ€, â€œtextPositionâ€, â€œsizeâ€ and â€œalign"). We 
> >> therefore recommend replacing the single-letter cue commands with 
> >> these longer names.
> >
> > That would massively bloat these files and make editing them a huge 
> > pain, as far as I can tell. I agree that defaults would make it 
> > better, but many cues would still need their own positioning and 
> > sizing information, and anything beyond a very few letters would IMHO 
> > quickly become far too verbose for most people. "L", "A", and "S" are 
> > pretty mnemonic, "T" would quickly become familiar to people writing 
> > cues, and "D" is only going to be relevant to some authors but for 
> > those authors it's pretty self-explanatory as well, since the value is 
> > verbose.
> 
> It took me 6 months before I got used to them for authoring subtitle 
> files, but indeed I have grown accustomed and can deal with them now.

These will most likely change after all, for this bug:

   http://www.w3.org/Bugs/Public/show_bug.cgi?id=14646


> >> We are happy to see the introduction of the magic file identifier for 
> >> WebVTT which will make it easier to identify the file format. We do 
> >> not believe the â€œFILEâ€ part of the string is necessary.
> >
> > I have removed it.
> 
> Thanks. You should also remove the text ", or the seventh character is 
> neither a U+0020 SPACE character nor a U+0009 CHARACTER TABULATION (tab) 
> character," from step 7 of the parsing, since such a seventh character 
> does not need to exist at all.

That would make it impossible to e.g. have Emacs modelines on the first 
line.


> >> However, we recommend to also introduce a format version number that 
> >> the file adheres to, e.g. â€œWEBVTT 0.7â€.
> >
> > Version numbers are an antipattern on the Web, so I have not added 
> > one.
> 
> We can have it underneath the file magic in another line of the header 
> now where the metadata will be (or will be when we make V2 of the format 
> ;-), so that's fine.

We should never have a version number, anywhere.


> >> * Voice synthesis of e.g. mixed English/French captions. Given that 
> >> this would only be useful to be people who know both languages, it 
> >> seem not worth complicating the format for.
> 
> I disagree with the third case. Many people speak more than one language 
> and even if they don't speak the language that is in use in a cue, it is 
> still bad to render it in using the wrong language model, in particular 
> if it is rendered by a screen reader. We really need a mechanism to 
> attach a language marker to a cue segment.

You think we should add a feature to WebVTT specifically for the use case 
of audio synthesis of subtitles that contain text in two languages for 
users who understand both languages?

That seems like a rather esoteric case to be something to handle in v1, 
if ever.


> > On Wed, 9 Feb 2011, Silvia Pfeiffer wrote:
> >>
> >> We're trying to avoid the need for multiple transcodings and are 
> >> trying to achieve something like the following pipeline: broadcast 
> >> captions -> transcode to WebVTT -> show in browser -> transcode to 
> >> broadcast devices -> show
> >
> > Why not just do:
> >
> > Â  broadcast captions -> transcode to WebVTT -> show in browser
> >
> > ...for browsers and:
> >
> > Â  broadcast captions -> show
> >
> > ...for legacy broadcast devices?
> 
> Eventually, we will want to get rid of the legacy format and just 
> deliver WebVTT, but they still need to display as though they came from 
> the original broadcast caption format for contractual reasons.

Change the contracts. We shouldn't be designing a format for the next 
hundred plus years around today's contracts for legacy content. That's 
completely backwards.


On Wed, 8 Jun 2011, Philip JÃ¤genstedt wrote:
> > > 
> > > When would one want these descriptions to be multi-language?
> > 
> > When they are describing something that is inherently multi-cultural. 
> > For example, the name of a restaurant which is in French, while the 
> > describer language is English.
> 
> Does this kind of thing currently work with screen readers? Non-French 
> people speaking English don't switch to proper French pronunciation when 
> saying something like "I'm really into film noir" or "The general 
> assumed political power through a coup d'etat", so screen reader users 
> actually want what? If one doesn't know French, it seems like it would 
> be harder to understand.
> 
> For languages further removed from English I'm fairly certain no English 
> speaker would want to hear the original pronunciation. Imagine 
> pronouncing "Mexico" in Spanish or "Beijing" in Mandarin Chinese in the 
> middle of an English text... I'm certain it would confuse people more 
> than help them understand.

Indeed.


On Fri, 10 Jun 2011, Silvia Pfeiffer wrote:
> 
> In the parsing section for cues, step 27, the default for cue is set to 
> 100. This means that every cue that has no explicit size setting ("S:") 
> will occupy the full width of the video viewport (height if vertical 
> renering), even if the displayed text is only short, such as "[music]". 
> I believe that is not the best default means of rendering subtitles and 
> captions, because more of the video's pixel are obstructed than is 
> necessary by the cue background box with its dark grey background 
> rgba(0,0,0,0.8).

The gray background only applies to the inline wrapper, not the box.


> In the parsing section for cues, step 25, the default line position for 
> cues is 'auto' and the default snap-to-lines flag is true. For cues that 
> have no explicit line position setting ("L:"), this means that the 
> height of the cue ends up getting y-position of 0 (see Section 2 with 
> the WebVTT cue text rendering rules, step 10, substep 9, first case ). 

I'm not sure what you mean by "the height of the cue ends up getting a 
y-position". The cue in that step gets a y-position of zero, but that is 
not its final position if the snap-to-lines flag is set; that's just the 
position used to get the line box height before the actual y-position is 
determined.


> 3. Calculation of Text Track cue line position
> 
> Assuming we've set a "L:100%" on a cue, then according to Section 2,
> step 10, substep 9, second case we arrive at a y-position of 100,
> leading to the setting of "top" to 100% of the video's height. This
> means that the cue will disappear beyond the bottom of the video
> viewport. Is that intended?

Again, that position is not the final position, it's just the position in 
order to calculate the box dimensions. The actual position is then 
recalculated a few steps lower, such that in the case you describe, the 
box is aligned with its bottom at the bottom of the viewport.


> Also, shouldn't the caption text box have been centered in the middle of 
> the caption text box's height at the L position rather than at the top 
> of that box?

I don't understand the question. Can you elaborate with an example?


> Similarly as for the vertical line positioning, I wonder whether there 
> is a problem with the horizontal "T:" text positioning. When we specify 
> T:25% on an A:middle cue box, the box is moved half its size to the left 
> of the T position, i.e. it ends up at -12.5% of the video viewport's 
> width. Is that intended? Should there be a way to limit how far a box 
> can be moved off the video viewport? Should it continue to be visible 
> when moved off the video viewport?

I think you are misreading the spec. The x-position in the case you 
describe is entirely unaffected by the alignment and T: values (modulo 
line wrapping). Do you mean the y-position? The size in the case you give 
is twice the T: position (so 50%), which leads to a y-position of zero 
(T:25% minus half the 50% size is 25 minus 25 is zero).


On Mon, 27 Jun 2011, Silvia Pfeiffer wrote:
> 
> What Ronny says there is that in his implementation the default display 
> size of the cue (i.e. the dark box that the cue is displayed in) is only 
> as wide as the longest line in the cue (or high where we're dealing with 
> vertical direction). Currently, the spec puts as a default S:100%.

How can you know the width of the widest line before you know the size?

The size we're talking about here isn't the size of the background box, 
it's the size of the block into which the captions are rendered.


> 2. Cue voice tag:
> "this differs from specs in the way that opened <v> voice tags should
> be closed with </v>"
> 
> Ronny's point is that the <v> element is expected to be closed,
> because it makes it easier to parse. So, instead of:
> 
> 00:01:07.395 --> 00:01:10.246
> <v John Do>Hey!
> <v Jane Doe>Hello!
> 
> he expects:
> 
> 00:01:07.395 --> 00:01:10.246
> <v John Do>Hey!</v>
> <v Jane Doe>Hello!</v>
> 
> I think the same is true for his implementation of the <c> class tags.

As far as I can tell, this is based on a misreading of the specification. 
Either I misunderstand your comment, or the spec already says what this 
suggests it should say.


On Sun, 7 Aug 2011, Silvia Pfeiffer wrote:
> 
> I am right now trying to figure out how vertical growing left cues (i.e. 
> cues with a cue rendering setting of "D:vertical") are rendered.
> 
> If nothing else is set on the cue, my expectation would be that the cue 
> would be rendered on the right side of the video viewport, since it's 
> growing to the left.
> 
> As I follow through the algorithm at
> http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-cue-text-rendering-rules
> , I find that the default settings are:
> * the text track cue line position default is "auto",
> * the snap-to-lines flag is "true" by default,
> * block flow is left to right
> and in step 9 we get:
> "If the text track cue writing direction is vertical growing left, and
> the text track cue snap-to-lines flag is set, let x-position be zero".
> 
> I think this is incorrect and should be "..., let x-position be 100"
> so as to allow the text boxes to flow onto the video viewport from the
> right boundary, rather than off its left border.

Again, this number is not the number used for positioning. It's just a 
temporary number used for sizing.


On Tue, 9 Aug 2011, Silvia Pfeiffer wrote:
> 
> It seems that where we have specified how to parse the cue settings, we 
> only allow a single white space as separator between subsequent cue 
> settings: 
> http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#parse-the-webvtt-settings
> 
> Thus, something like this is allowed: "D:vertical A:middle"
> but not something like this: "D:vertical         A:middle".
> 
> I think we need to add a skip white space in step three.

I think this is fixed now.


> While the syntax spec says "The time represented by this WebVTT 
> timestamp must be greater than or equal to the start time offsets of all 
> previous cues in the file." , there is no step in the parse that will 
> ascertain that cues that come our of time are dropped on the floor. Do 
> we need to include such a requirement before step 40 of the parser?

Why would we drop them on the floor?


On Wed, 20 Jul 2011, Marc 'Tafouk' wrote:
> 
> I have another question about self-closing tags in cue text. It seems 
> they're not supported at all. The U+002F SOLIDUS character (/) is only 
> handled in the WebVTT tag state.
> 
> Test case 1-a):
>    WEBVTT
> 
>    00:00.000 --> 00:02.000
>    Initial <b/> test
> 
> U+0062 (b) triggers "WebVTT start tag state"; U+002F is then handled as 
> "Anything else" and is appended to result (tagname = "b/").

Right.


> Test case 1-b):
>    WEBVTT
> 
>    00:00.000 --> 00:02.000
>    Initial <b /> test
> 
> U+0062 (b) triggers "WebVTT start tag state"; U+0020 (space) triggers 
> "WebVTT start tag annotation state"; U+002F is handled as "Anything 
> else" and is appended to buffer (annotation = "/").

Right.


> I am aware those may be moot atm because there is no void element AFAIK, 
> and the current tags make no sense when immediately closed.

Well, also, the /> syntax thing is an XMLism and this isn't XML.


> I also found a slight issue when following the parser specs : there is no 
> validation of the class attribute.
> 
> Test case 2):
>    WEBVTT
> 
>    00:00.000 --> 00:02.000
>    Second <c.......... [my annotation]> test
> 
> classes is a list of 10 empty strings.

When you create the WebVTT Internal Node Object, empty classes are 
dropped.


On Wed, 20 Jul 2011, Silvia Pfeiffer wrote:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html#attach-a-webvtt-internal-node-object 
> says to attach the list of classes to the element. Right now, all 
> characters are allowed for class names bar space, tab, "." and ">". It 
> might indeed be an idea to restrict these character to those allowed for 
> class names in HTML.

HTML allows even more characters (everything except whitespace).


On Sun, 19 Jun 2011, Rodger Combs wrote:
>
> There are a few possible cases when JavaScript may need to add, remove, 
> read, or modify a cue from a <track>:
>
> 1. A web-based caption editor
>
> 2. Parsing captions from an external non-WebVTT file (retrieved with 
> XHR, EventSource [for live videos], WebSocket, etc.)
>
> 3. Live translating of captions using an external translation API
>
> Adding a set of methods to the TextTrackCueList for cue modification 
> could be useful. Here's an example interface:
> 
> TextTrackCue addCue(in double startTime, in double endTime, in DOMString text, in optional DOMString[] flags);

Already exists as:

track.addCue(new TextTrackCue(id, startTime, endTime, text, settings, 
pauseOnExit));


> void removeCue(in TextTrackCue);

Already exists as:

track.removeCue(cue);


> void removeCueById(in DOMString id);

Already exists as:

track.removeCue(track.cues.getCueById(id));

(Should we make getCueById() a getter?)


> Also, I recommend that in TextTrackCue, startTime, endTime, and 
> pauseOnExit are made non-readonly, and that "attribute DOMString text;" 
> is added to the interface.

Done.


On Wed, 21 Sep 2011, Philip JÃ¤genstedt wrote:
> 
> If you look at the source of the spec, you'll find comments as a v2 
> feature request:
> 
> COMMENT -->
> this is a comment, bla bla
> 
> I do not think this would be very useful. As a one-line comment at the 
> top of the file (for authorship, etc) it is rather verbose and ugly, 
> while for commenting out cues you would have to comment out each cue 
> individually. It also doesn't work inside cues, where something like <! 
> comment > is what would be backwards compatible with the current parser. 
> If comments are left for v2, the above is what it'll be, because of 
> compatibility constraints. If anyone is less than impressed with that, 
> now would be the time to suggest an alternative and have it spec'd.

I've removed the comment in the spec. It wouldn't work well with the 
recent parser changes anyway.


> The WebVTT layout algorithm tries to not move cues around once they've 
> been displayed and to never obscure other cues. This means that for cues 
> that overlap in time, the rendering will often be out of order, with the 
> earliest cue at the bottom. This is quite contrary to the (mainly US?) 
> style of (live) scrolling captions, where cues are always in order and 
> scroll to bring new captions into view. (I am not suggesting any 
> specific change.)

Live captions are different, IMHO. They're one cue that happens to have 
lots of data over a long period of time, not multiple cues.


> Scaling the font size with the video will not be optimal for either 
> small screens (text will be too small) or very large screens (text will 
> be too big). Do we change the default rendering in some way, or do we 
> let users override the font size? If users can override it, do we care 
> that this may break the intended layout of the author?

How small are we talking about here? DVD content seems to work fine with 
subtitles being a fixed size relative to the video size. If you can see 
the video content, you can see the subtitles, surely.

Naturally, people will always be able to override the author.


> The parser is fairly strict in some regards:
> 
> * double id line discards entire cue
> (http://www.w3.org/Bugs/Public/show_bug.cgi?id=13943)
> * must use exactly 2 digits for minutes and seconds
> * minutes and seconds must be <60
> * must use "." as the decimal separator
> * must use exactly 3 decimal digits
> * stray "<" consumes the rest of the cue text

For an overview of the design philosophy here, please see this comment on 
the above bug:

   https://www.w3.org/Bugs/Public/show_bug.cgi?id=13943#c17


> In most systems chapters are really chapter markers, a point in time. A 
> chapter implicitly ends when the next begins. For nested chapters this 
> isn't so, as the end time is used to determine nesting. Do we expect 
> that UIs for chapter navigation make the end time visible in some 
> fashion (e.g. highlighting the chapter on the timeline)

Realistically, I don't expect to see timelines... but if the start and end 
times are different, that is certainly a sensible thing to use them for.


> or that when a chapter it is chosen, it will pause at the end time?

I certainly don't expect any pausing behaviour, though UAs are of course 
welcome to do that if that's what their users want.


> A suggestion that was brought up when discussing chapters. When one 
> simply wants the chapter to end when the next starts, it's a bit of a 
> hassle to always include the end time. Some additional complexity in the 
> parser could allow for this:
> 
> 00:00.000 --> next
> Chapter 1
> 
> 01:00.000 --> next
> Intermezzo
> 
> 02:00.000 --> next
> Last Chapter
> 
> Cues would be created with endTime = Infinity, and be modified to the 
> startTime of the following cue (in source order) if there is a following 
> cue. This would IMO be quite neat, but is the use case strong enough?

The same feature would be useful for captions sometimes, actually. I don't 
know how strong the use case is. It's purely syntactic sugar. It probably 
depends on how common it is for VTT files to be hand-authored vs written 
with tools.


On Wed, 21 Sep 2011, Ralph Giles wrote:
> 
> I don't like the format either. I do think it's very important we have 
> some mechanism for multi-line file level metadata, embedded css, etc. so 
> the files can live on their own.

What is the use case?


> The syntax section also suggests all metadata has to be on the signature 
> line, while the parser will actually skip everything between the 
> signature and the first double line terminator.

The stuff on the signature line is not metadata, what suggests that it is? 
I just allow stuff there to allow things like Emacs mode lines to be 
ignored.


> For in-caption, <! comment> is a good idea. Semantically it's a bit 
> weird to not mention it in the spec, since everything else has an end 
> tag, but the parser will ignore it as we want.

What's the use case for inline comments?


> I'm not normally one for restrictions, but parser also says the 
> (optional) hours field must have "two or more" digits, with no maximum 
> value specified.
> 
> If we all agree on an implementation limit, it could be helpful to 
> specify one. Storing milliseconds in a 32 bit type gives a little over 
> 1000 hours of timestamps. Single-precision float runs out of useful 
> precision after about 50 hours. I'd suggest a two or three digit limit 
> on hours to avoid requiring a 64 bit type. If we don't care about that, 
> then 10 digits is a reasonable limit to avoid running out of precision 
> with doubles.

As a general rule, I think we shold avoid defining limits for things that 
will naturally get less limited over time.


On Wed, 21 Sep 2011, Glenn Maynard wrote:
> 
> My take on in-caption comments was to put them in a class and hide the 
> class, with the advantage of the comments being available 
> programmatically (you can toggle them on for editing purposes by 
> un-hiding the class) and requiring no additional specification, though 
> the disadvantage that you need a stylesheet to properly view the 
> resulting file.  To avoid that, maybe a separate span type would be 
> better, analogous to the "hidden" HTML attribute.

What's the use case, though? If it's notes to a translator, or notes about 
uncertain captioning, presumably you would want to strip those out before 
publishing the captions.


> (I don't want to restart the whole comment discussion, but it'd be 
> unfortunate if the desired syntax for inline comments never happens due 
> to being punted to v2 and then being impossible to make 
> backwards-compatible with v1.)

I don't think that's a concern, since unknown tags get dropped. We can 
always add it later if it is something we need.



On Thu, 29 Sep 2011, Silvia Pfeiffer wrote:
> 
> Also note that YouTube is experimenting with richer captions, see 
> http://www.youtube.com/watch?v=0xTURXWoJ6A (check the different caption 
> tracks) . These are representing some of the features that the US TV 
> standard CEA608/708 captions support, so we need to make sure they are 
> also supported by browsers, otherwise we get a lower quality result with 
> captions on the Web that we get with captions on TV.

I think we are already at a higher quality level with WebVTT as it is 
today. We don't need to be a superset to be better.


> What happens with the new lines that are created by wrapping should, 
> however, be defined better than what we have right now. In other 
> existing caption formats, there is the concept of an "anchor". The box 
> into which the caption text is rendered is "anchored" to the video by 
> choosing a point inside the one-line caption cue box and a point on the 
> video viewport and anchoring that point. The box then grows around that 
> point in equal parts. For example, if the box is anchored at its top 
> middle point and assuming horizontally rendered text, the box will grow 
> down from that point. If it's anchored at the bottom middle point, the 
> box will grow up (even if the text is wrapped down and grows down - i.e. 
> the first line will be moved up before the second line is rendered).

For block-progression-direction positioning, we already essentially have 
this. For inline-progression-direction positioning, if we need to support 
the same model as for block-position we can add support for it as a new 
unit on the inline-position setting ("T:" currently). But I don't think we 
need to; we already provide a way to anchor to a specific position.


On Wed, 5 Oct 2011, Simon Pieters wrote:
>
> I did some research on authoring errors in SRT timestamps to inform 
> whether WebVTT parsing of timestamps should be changed.

Thanks!


> [...] 65,000 files [...]
> Grepping for lines that contain "-->" resulted in 52,000,000 lines [...]
>
> Of those, there were 31,900 lines that are invalid, i.e. don't match the 
> python regexp 
> '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.

Wow, 31,900 is 0.06%, which is a really low error rate, at least comapred 
to HTML (which is in the single digit percentages at best, probably double 
digits).

Looking at the errors you listed, ignoring those that occured in at most 
1% of files (i.e. fewer than 650 occurrences), we get:

> 00834: hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'

...but in another e-mail you said that 671 of these came from one file 
whose hours were all "255", so I'll ignore this one.


> 00889: seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'

We could zero pad by default, but that seems a bit dodgy -- what if the 
missing digit is not zero? Given how rare this is, I wonder what causes 
it. Is it hand-authoring mistake? Were the seconds always "0" in these 
cases, or were they non-zero seconds?

How many files did this affect?


> 00922: spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not '(\d+[:\.,]){2,3}\d+'

Odd. Anecdotally, any idea what was going on with these?

How many files did this affect?


> 02085: decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'

I wonder if people saying 0.1s mean 0.001s or 0.100s. If the latter, we 
can probably support this without too much trouble. Do you have any 
insight into this? e.g. what were the other times around such shortened 
times? Were they also short? e.g. did it ever go ...:0.9 --> ...:0.800 
(meaning the time was intended to be milliseconds), or was 0.9 always 
followed by a time greater than a second later (meaning the time was 
intended to be a fraction)?

How many files did this affect?


> 25372: dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'

The spec actually only allows a dot, so really that's 51,974,628 lines 
that used a comma instead of a dot... This was an intentional choice; part 
of converting an SRT file to VTT is to mechanically change this. (It seems 
more likely that people would use a period mistakenly instead of a comma 
than vice versa, so hopefully this isn't an authoring problem for us.)



On Wed, 5 Oct 2011, David Singer wrote:
> 
> I rather expect that there may be people tempted to write an 
> implementation that will ingest SRT and VTT, and unify their parsing to 
> cope with either. "Be strict with what you produce, and liberal with 
> what you accept" is a maxim for at least some people, also.  And being 
> strict with HTML (I seem to recall that one of the features of XHTML was 
> that nothing was supposed to show when documents had errors) didn't get 
> a lot of traction, either.

Yeah, but HTML didn't have a defined parser. VTT does. (HTML does too now, 
and browsers are converging on it.)


On Wed, 5 Oct 2011, Ralph Giles wrote:
> 
> A point Philip JÃ¤genstedt has made is that it's sufficiently tedious to 
> verify correct subtitle playback that authors are unlikely to do so with 
> any vigilance. Therefore the better trade-off is to make the parser 
> forgiving, rather than inflict the occasional missing cue on viewers.

Being forgiving but misinterpreting the times will still inflict missing 
cues, it'll just additionally inflict broken cues elsewhere.


On Thu, 6 Oct 2011, Philip JÃ¤genstedt wrote:
> 
> To clarify, I have certainly never suggested that implementation do 
> anything other than follow the spec to the letter. I *have* suggested 
> that the parsing spec be more tolerant of certain errors, but looking at 
> the extremely low error rates in our sample I have to conclude that 
> either (1) the data is biased or (2) most of these errors are not common 
> enough that they need to be handled.

Agreed (on both counts).


On Mon, 24 Oct 2011, Simon Pieters wrote:
>
> I wanted to research how common it is to fail to separate cues in SRT, 
> and for what reason.
> 
> SRT parsers usually interpret a timings line as a new cue, while WebVTT 
> wants two blank lines for a new cue.
>
> I took the 65k SRT files we've got, replaced comma with dot and 
> prepended "WEBVTT\n\n", then ran them in Opera's <track> impl, looking 
> for '-->' in cue data.
> 
> There were 840 files with --> in cue data. This is 1.3% of the files.
> 
> Looking at the cue data, there were 11,118 lines that contained -->. 
> There were 8830 lines of only whitespace.
>
> In the cue data, if I look at valid-looking timing lines 
> (/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and 
> check the line before that, or the line before *that* if it looks like 
> an SRT id (/^\d+\s*$/), then I see 7030 lines of only whitespace and 
> 3761 lines of something else.
> 
> Failing to separate cues results in an unpleasant experience for the 
> user, since basically the screen is filled with several "cues" including 
> their IDs and timing lines.
> 
> Some files had most or all of their cues parsed as a single cue with the 
> WebVTT parser, e.g. because all lines ended with one or more spaces. 
> Looking at such a file in a text editor, it's not immediately obvious 
> that there's an error, because the spaces are not visible. Moreover, the 
> file is not non-conforming, so a validator wouldn't help either.
> 
> So what about the cases that aren't whitespace? It seems to be mostly 
> just missing the newline completely. Some omitted the ID also. One file 
> had a "|" between all cues.
> 
> My recommendation is http://www.w3.org/Bugs/Public/show_bug.cgi?id=14550

Thanks for this data. The spec has since been updated to scan for the 
timing line rather than just a blank line. A blank line still ends a cue, 
though. Is that a problem?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'