[whatwg] Video feedback

Ian Hickson ian at hixie.ch
Thu Jun 2 16:28:45 PDT 2011


(Note that I have tried to only reply to each suggestion once, so 
subsequent requests for the same feature are not included below.)

(I apologise for the somewhat disorganised state of this e-mail. I 
normally try to group topics together, but the threads I'm responding to 
here jumped back and forth across different issues quite haphazardly and 
trying to put related things together broke some of the flow and context 
of the discussions, so I opted in several places to leave the context as 
it was originally presented, and just jump back and forth amongst the 
topics raised. Hopefully it's not too confusing.)

On Thu, 9 Dec 2010, Silvia Pfeiffer wrote:
> >> > >
> >> > > Sure, but this is only a snippet of an actual application. If, 
> >> > > e.g., you want to step through a list of videos (maybe an 
> >> > > automated playlist) using script and you need to provide at least 
> >> > > two different formats with <source>, you'd want to run this 
> >> > > algorithm frequently.
> >> >
> >> > Just have a bunch of <video>s in the markup, and when one ends, 
> >> > hide it and show the next one. Don't start dynamically manipulating 
> >> > <source> elements, that's just asking for pain.
> >> >
> >> > If you really must do it all using script, just use canPlayType and 
> >> > the <video src=""> attribute, don't mess around with <source>.
> >>
> >> Thanks for adding that advice. I think it's important to point that 
> >> out.
> >
> > I can add it to the spec too if you think that would help. Where would 
> > a good place for it be?
> 
> There is a note in the <source> element section that reads as follows: 
> "Dynamically modifying a source element and its attribute when the 
> element is already inserted in a video or audio element will have no 
> effect. To change what is playing, either just use the src attribute on 
> the media element directly, or call the load() method on the media 
> element after manipulating the source elements."
> 
> Maybe you can add some advice there to use canPlayType to identify what 
> type of resource to add in the @src attribute on the media element. 
> Also, you should remove the last half of the second sentence in this 
> note if that is not something we'd like to encourage.

Done.


On Wed, 8 Dec 2010, Kevin Marks wrote:
> 
> One case where posters come back after playback is complete is when 
> there are multiple videos on the page, and only one has playback focus 
> at a time, such as a page of preview movies for longer ones to purchase.
> 
> In that case, showing the poster again on blur makes sense conceptually.
> 
> It seems that getting back into the pre-playback state, showing the 
> poster again would make sense in this context.
> 
> That would imply adding an unload() method that reverted to that state, 
> and could be used to make any cached media data purgeable in favour of 
> another video that is subsequently loaded.

You don't need unload(), you can just use load(). It essentially resets 
the media element.

It's not hugely efficient, but if we find people are trying to do this a 
lot, then we can add a more efficent variant that just resets the poster 
frame state, I guess. (I'd probably call it stop(), though, not unload().)


On Thu, 9 Dec 2010, David Singer wrote:
>
> I think if you want that effect, you flip what's visible in an area of 
> the page between a playing video, and an image.  Relying on the poster 
> is not effective, IMHO.

I don't know, I think it would make semantic sense to have all the videos 
be <video> elements if they're actually going to be played right there.


On Thu, 9 Dec 2010, Kevin Marks wrote:
>
> I know it's not effective at the moment; it is a common use case. 
> QuickTime had the 'badge' ux for years that hardly anyone took advantage 
> of:
> 
> http://www.mactech.com/articles/mactech/Vol.16/16.02/Feb00QTToolkit/index.html
> 
> What we're seeing on the web is a converged implementation of the 
> YouTube-like overlaid grey play button, but this is effectively 
> reimplemented independently by each video site that enables embedding.
> 
> As we see HTML used declaratively for long-form works like ebooks on 
> lower performance devices, having embedded video that doesn't 
> cumulatively absorb all the memory available is going to be like the old 
> CD-ROM use cases the QT Badge was meant for.

This seems like a presentational issue, for which CSS would be better 
positioned to provide a solution.


On Thu, 9 Dec 2010, Boris Zbarsky wrote:
> On 12/8/10 8:19 PM, Ian Hickson wrote:
> > Boris wrote:
> > > You can't sniff in a toplevel browser window.  Not the same way that 
> > > people are sniffing in <video>. It would break the web.
> > 
> > How so?
> 
> People actually rely on the not-sniffing behavior of UAs in actual 
> browser windows in some cases.  For example, application/octet-stream at 
> toplevel is somewhat commonly used to force downloads without a 
> corresponding Content-Disposition header (poor practice, but support for 
> Content-Disposition hasn't been historically great either).
> 
> > (Note that the spec as it stands takes a compromise position: the 
> > content is only accepted if the Content-Type and type="" values are 
> > supported types (if present) and the content sniffs as a supported 
> > type, but nothing in the spec checks that all three values are the 
> > same.)
> 
> Ah, I see.  So similar to the way <img> is handled...
> 
> I can't quite decide whether this is the best of both worlds, or the 
> worst. ;)

Yeah, I hear ya.


> It certainly makes it simpler to implement video by delegating to 
> QuickTime or the like, though I suspect such an implementation would 
> also end up sniffing types the UA doesn't necessarily claim to 
> support.... so maybe it's not simpler after all.

Indeed.

At this point I'm basically just waiting to see what implementations end 
up doing. When I tried moving us more towards sniffing, there was 
pushback; when I tried moving us more towards honouring types, there was 
equal and opposite pushback. So at this point, I'm letting the market 
decide it. :-)


On Thu, 9 Dec 2010, Simon Pieters wrote:
> On Thu, 09 Dec 2010 02:58:12 +0100, Ian Hickson <ian at hixie.ch> wrote:
> > On Wed, 1 Sep 2010, Simon Pieters wrote:
> > > 
> > > I think it might be good to run the media element load algorithm 
> > > when setting or changing src on <source> (that has a media element 
> > > as its parent), but not type and media (what's the use case for type 
> > > and media?). However it would fire an 'emptied' event for each 
> > > <source> that changed, which is kind of undesirable. Maybe the media 
> > > element load algorithm should only be invoked if src is set or 
> > > changed on a <source> that has no previous sibling <source> 
> > > elements?
> > 
> > What's the use case? Just set .src before you insert the element.
> 
> The use case under discussion is changing to another video. So the 
> element is already inserted and already has src.
> 
> Something like:
> 
> <video controls autoplay>
> <source src=video1.webm type=video/webm>
> <source src=video1.mp4 type=video/mp4>
> </video>
> <script>
> function loadVideo(src) {
>  var video = document.getElementsByTagName('video')[0];
>  sources = video.getElementsByTagName('source');
>  sources[0].src = src + '.webm';
>  sources[1].src = src + '.mp4';
> }
> </script>
> <input type="button" value="See video 1" onclick="loadVideo('video1')">
> <input type="button" value="See video 2" onclick="loadVideo('video2')">
> <input type="button" value="See video 3" onclick="loadVideo('video3')">

Well if you _really_ want to do that, just call video.load() at the end of 
loadVideo(). But really, you're better off poking around with 
canPlayType() and setting video.src directly instead of using <source> 
for these dynamic cases.


On Thu, 9 Dec 2010, Kevin Carle wrote something more or less like:
> 
> function loadVideo(src) {
>  var video = document.getElementsByTagName('video')[0];
>  if (video.canPlayType("video/webm") != "")
>    video.src = src + '.webm';
>  else
>    video.src = src + '.mp4';
> }

Yeah.

And hopefully this will become moot when there's a common video format, 
anyway.


On Fri, 10 Dec 2010, Simon Pieters wrote:
> 
> You'd need to remove the <source> elements to keep the document valid.

You don't need them in the first place if you're doing things by script, 
as far as I can tell.


> The author might want to have more than two <source>s, maybe with 
> media="", onerror="" etc. Then it becomes simpler to rely on the 
> resource selection algorithm.

It's hard to comment without seeing a concrete use case.


On Tue, 14 Dec 2010, Philip Jägenstedt wrote:
> On Wed, 24 Nov 2010 17:11:02 +0100, Eric Winkelman <E.Winkelman at cablelabs.com>
> wrote:
> >
> > I'm investigating how TimedTracks can be used for in-band-data-tracks 
> > within MPEG transport streams (used for cable television).
> > 
> > In this format, the number and types of in-band-data-tracks can change 
> > over time.  So, for example, when the programming switches from a 
> > football game to a movie, an alternate language track may appear that 
> > wasn't there before. Later, when the programming changes again, that 
> > language track may be removed.
> > 
> > It's not clear to me how these changes are exposed by the proposed 
> > Media Element events.
> 
> The thinking is that you switch between different streams by setting the 
> src="" attribute to point to another stream, in which case you'll get an 
> emptied event along with another bunch of events. If you have a single 
> source where audio/video/text streams appear and disappear, there's not 
> really any way to handle it.

As specified, there's no way for a media element's in-band text tracks to 
change after the 'loadedmetadata' event has fired.


> > The "loadedmetadata" event is used to indicate that the TimedTracks 
> > are ready, but it appears that it is only fired before playback 
> > begins.  Is this event fired again whenever a new track is discovered?  
> > Is there another event that is intended for this situation?
> > 
> > Similarly, is there an event that indicates when a track has been 
> > removed? Or is this also handled by the "loadedmetadata" event 
> > somehow?
> 
> No, the loadedmetadata event is only fired once per resource, it's not 
> the event you're looking for.
> 
> As for actual solutions, I think that having loadedmetadata fire again 
> if the number or type of streams change would make some sense.

It would be helpful to know more about these cases where there are dynamic 
changes to the audio, video, or text tracks. Does this really happen on 
the Web? Do we need to handle it?


On Thu, 16 Dec 2010, Silvia Pfeiffer wrote:
> 
> I do not know how technically the change of stream composition works in 
> MPEG, but in Ogg we have to end a current stream and start a new one to 
> switch compositions. This has been called "sequential multiplexing" or 
> "chaining". In this case, stream setup information is repeated, which 
> would probably lead to creating a new steam handler and possibly a new 
> firing of "loadedmetadata". I am not sure how chaining is implemented in 
> browsers.

Per spec, chaining isn't currently supported. The closest thing I can find 
in the spec to this situation is handling a non-fatal error, which causes 
the unexpected content to be ignored.


On Fri, 17 Dec 2010, Eric Winkelman wrote:
> 
> The short answer for changing stream composition is that there is a 
> Program Map Table (PMT) that is repeated every 100 milliseconds and 
> describes the content of the stream.  Depending on the programming, the 
> stream's composition could change entering/exiting every advertisement.

If this is something that browser vendors want to support, I can specify 
how to handle it. Anyone?


On Sat, 18 Dec 2010, Robert O'Callahan wrote:
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#dom-media-duration says:
> [...]
> 
> What if the duration is not currently known?

The user agent must determine the duration of the media resource before 
playing any part of the media data and before setting readyState to a 
value equal to or greater than HAVE_METADATA, even if doing so requires 
fetching multiple parts of the resource.


> I think in general it will be very difficult for a user-agent to know 
> that a stream is unbounded. In Ogg or WebM a stream might not contain an 
> explicit duration but still eventually end. Maybe it would make more 
> sense for the last sentence to read "If the media resource is not known 
> to be bounded, ..."

Done.


On Sat, 18 Dec 2010, Philip Jägenstedt wrote:
> 
> Agreed, this is how I've interpreted the spec already. If a server 
> replies with 200 OK instead of 206 Partial Content and the duration 
> isn't in the header of the resource, then the duration is reported to be 
> Infinity. If the resource eventually ends another durationchange event 
> is fired and the duration is reported to be the (now known) length of 
> the resource.

That's fine.


On Mon, 20 Dec 2010, Robert O'Callahan wrote:
> 
> That sounds good to me. We'll probably do that. The spec will need to be 
> changed though.

I changed it as you suggest above.


On Fri, 31 Dec 2010, Bruce Lawson wrote:
> > On Fri, 5 Nov 2010, Bruce Lawson wrote:
> > > 
> > > http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#sourcing-in-band-timed-tracks 
> > > says to create TimedTrack objects etc for in-band tracks which are 
> > > then exposed in the API - so captions/subtitles etc that are 
> > > contained in the media container file are exposed, as well as those 
> > > tracks pointed to by the <track> element.
> > > 
> > > But 
> > > http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#timed-track-api 
> > > implies that the array is only of tracks in the track element:
> > > 
> > > "media . tracks . length
> > > 
> > > Returns the number of timed tracks associated with the media element 
> > > (e.g. from track elements). This is the number of timed tracks in 
> > > the media element's list of timed tracks."
> > 
> > I don't understand why you interpret this as implying anything about 
> > the track element. Are you interpreting "e.g." as "i.e."?
> > 
> > > Suggestion: amend to say "Returns the number of timed tracks 
> > > associated with the media element (e.g.  from track elements and any 
> > > in-band track files inside the media container file)" or some such.
> > 
> > I'd rather avoid talking about the in-band ones here, in part because 
> > I think it's likely to confuse authors at least as much as help them, 
> > and in part because the terminology around in-band timed tracks is a 
> > little unclear to me and so I'd rather not talk about them in 
> > informative text. :-)
> > 
> > If you disagree, though, let me know. I can find a way to make it 
> > work.
> 
> I disagree, but not aggressively vehemently. My confusion was conflating 
> "track elements" with the three instances of the phrase "timed tracks" 
> in close proximity.
> 
> I suggest that "Returns the number of timed tracks associated with the 
> media element (i.e. from track elements and any packaged along with the 
> media in its container file)" would be clearer and avoid use of the 
> confusing phrase "in-band tracks".

That's still confusing, IMHO. "Packaged" doesn't imply in-band; most 
subtitle files are going to be "packaged" with the video even if they're 
out of band.

Also, your 'i.e.' here is wrong. There's at least one other source of 
tracks: the ones added by the script.

The non-normative text is intentionally not overly precise, because if it 
was precise it would just be the same as the normative text and wouldn't 
be any simpler, defeating its entire purpose.


On Mon, 3 Jan 2011, Philip Jägenstedt wrote:
> >
> > + I've added a magic string that is required on the format to make it
> >   recognisable in environments with no or unreliable type labeling.
> 
> Is there a reason it's "WEBVTT FILE" instead of just "WEBVTT"? "FILE" 
> seems redundant and like unnecessary typing to me.

It seemed more likely that non-WebVTT files would start with a line that 
said just "WEBVTT" than a line that said just "WEBVTT FILE". But I guess 
"WEBVTT FILE FORMAT" is just as likely and it'll be caught.

I've changed it to just "WEBVTT"; there may be existing implementations 
that only accept "WEBVTT FILE" so for now I recommend that authors still 
use the longer header.


> > On Wed, 8 Sep 2010, Philip Jägenstedt wrote:
> > > 
> > > In the discussion on public-html-a11y <trackgroup> was suggested to 
> > > group together mutually exclusive tracks, so that enabling one 
> > > automatically disables the others in the same trackgroup.
> > > 
> > > I guess it's up to the UA how to enable and disable <track>s now, 
> > > but the only option is making them all mutually exclusive (as 
> > > existing players do) or a weird kind of context menu where it's 
> > > possible to enable and disable tracks completely independently. 
> > > Neither options is great, but as a user I would almost certainly 
> > > prefer all tracks being mutually exclusive and requiring scripts to 
> > > enable several at once.
> > 
> > It's not clear to me what the use case is for having multiple groups 
> > of mutually exclusive tracks.
> > 
> > The intent of the spec as written was that a browser would by default 
> > just have a list of all the subtitle and caption tracks (the latter 
> > with suitable icons next to them, e.g. the [CC] icon in US locales), 
> > and the user would pick one (or none) from the list. One could easily 
> > imagine a UA allowing the user to enable multiple tracks by having the 
> > user ctrl-click a menu item, though, or some similar solution, much 
> > like with the commonly seen select box UI.
> 
> In the vast majority of cases, all tracks are intended to be mutually 
> exclusive, such as English+English HoH or subtitles in different 
> languages. No media player UI (hardware or software) that I have ever 
> used allows enabling multiple tracks at once. Without any kind of hint 
> about which tracks make sense to enable together, I can't see desktop 
> Opera allowing multiple tracks (of the same kind) to be enabled via the 
> main UI.

Personally I think it's quite reasonable to want to see two languages at 
once, or even two forms of the same language at once, especially for, 
e.g., reviewing subtitles. But I don't think it would be a bad thing if 
some browsers didn't expose that in the UI; that's something that could 
be left to bookmarklets, for example.


> Using this syntax, I would expect some confusion when you omit the closing
> </v>, when it's *not* a cue spoken by two voices at the same time, such as:
> 
> <v Jim>- Boo!
> <v Bob>- Gah!
> 
> Gah! is spoken by both Jim and Bob, but that was likely not intended. If 
> this causes confusion, we should make validators warn about multiple 
> voices with with no closing </v>.

No need to just warn, the spec says the above is outright invalid, so 
they would raise an error.


> > > For captions and subtitles it's less common, but rendering it 
> > > underneath the video rather than on top of it is not uncommon, e.g. 
> > > http://nihseniorhealth.gov/video/promo_qt300.html or
> > 
> > Conceptually, that's in the video area, it's just that the video isn't 
> > centered vertically. I suppose we could allow UAs to do that pretty 
> > easily, if it's commonly desired.
> 
> It's already possible to align the video to the top of its content box 
> using <http://dev.w3.org/csswg/css3-images/#object-position>:
> 
> video { object-position: center top }
> 
> (This is already supported in Opera, but prefixed: -o-object-position)

Sounds good.


> Note that in Sweden captioning for the HoH is delivered via the teletext 
> system, which would allow ASCII-art to be displayed. Still, I've never 
> seen it. The only case of graphics being used in "subtitles" I can 
> remember ever seeing is the DVD of 
> <http://en.wikipedia.org/wiki/Cat_Soup>, where the subtitle system is 
> (ab)used to overlay some graphics.

Yeah, I'm not at all concerned about not supporting graphics in subtitles. 
It's nowhere near the 80% bar.


> If we ever want comments, we need to add support in the parser before 
> any content accidentally uses the syntax, in other words pretty soon 
> now.

No, we can use any syntax that the parser currently ignores. It won't 
break backwards compat with content that already uses it by then, since 
the whole point of comments is to be ignored. The only difference is 
whether validators complain or not.


> > On Tue, 14 Sep 2010, Anne van Kesteren wrote:
> > > 
> > > Apart from text/plain I cannot think of a "web" text format that 
> > > does not have comments.
> > 
> > But what's the use case? Is it really useful to have comments in a 
> > subtitle file?
> 
> Being able to put licensing/contact information at the top of the file 
> would be useful, just as it is in JavaScript/CSS.

Well the parser explicitly skips over anything in the header block 
(everything up to the first blank line IIRC), so if we find that people 
want this then we can allow it without having to change any UAs except the 
validators.


> > On Fri, 22 Oct 2010, Simon Pieters wrote:
> > > > 
> > > > It can still be inspired by it though so we don't have to change 
> > > > much. I'd be curious to hear what other things you'd clean up 
> > > > given the chance.
> > > 
> > > WebSRT has a number of quirks to be compatible with SRT, like 
> > > supporting both comma and dot as decimal separators, the weird 
> > > parsing of timestamps, etc.
> > 
> > I've cleaned the timestamp parsing up. I didn't see others.
> 
> I consider the cue id line (the line preceding the timing line) to be 
> cruft carried over from SRT. When we now both have classes and the 
> possibility of getting a cue by index, so why do we need it?

It's optional, but it is useful, especially for metadata tracks, as a way 
to grab specific cues. For example, consider a metadata or chapter track 
that contains cues with specific IDs that the site would use to jump to 
particular parts of the video in response to key presses, such as "start 
of content after intro", or maybe for a podcast with different segments, 
where the user can jump to "news" and "reviews" and "final thought" -- you 
need an ID to be able to find the right cue quickly.


> > > There was also some discussion about metadata. Language is sometimes 
> > > necessary for the font engine to pick the right glyph.
> > 
> > Could you elaborate on this? My assumption was that we'd just use CSS, 
> > which doesn't rely on language for this.
> 
> It's not in any spec that I'm aware of, but some browsers (including 
> Opera) pick different glyphs depending on the language of the text, 
> which really helps when rendering CJK when you have several CJK fonts on 
> the system. Browsers will already know the language from <track 
> srclang>, so this would be for external players.

How is this problem solved in SRT players today?


On Mon, 14 Feb 2011, Philip Jägenstedt wrote:
>
> Given that most existing subtitle formats don't have any language 
> metadata, I'm a bit skeptical. However, if implementors of non-browser 
> players want to implement WebVTT and ask for this I won't stand in the 
> way (not that I could if I wanted to). For simplicity, I'd prefer the 
> language metadata from the file to not have any effect on browsers 
> though, even if no language is given on <track>.

Indeed.


On Tue, 4 Jan 2011, Alex Bishop wrote:
> 
> Firefox too. If you visit 
> http://people.mozilla.org/~jdaggett/webfonts/serbianglyphs.html in 
> Firefox 4, the text explicitly marked-up as being Serbian Cyrillic 
> (using the lang="sr-Cyrl" attribute) uses some different glyphs to the 
> text with no language metadata.

This seems to be in violation of CSS; we should probably fix it there 
before fixing it in WebVTT since WebVTT relis on CSS.


On Mon, 3 Jan 2011, Philip Jägenstedt wrote:
>
> > > * The "bad cue" handling is stricter than it should be. After 
> > > collecting an id, the next line must be a timestamp line. Otherwise, 
> > > we skip everything until a blank line, so in the following the 
> > > parser would jump to "bad cue" on line "2" and skip the whole cue.
> > > 
> > > 1
> > > 2
> > > 00:00:00.000 --> 00:00:01.000
> > > Bla
> > > 
> > > This doesn't match what most existing SRT parsers do, as they simply 
> > > look for timing lines and ignore everything else. If we really need 
> > > to collect the id instead of ignoring it like everyone else, this 
> > > should be more robust, so that a valid timing line always begins a 
> > > new cue. Personally, I'd prefer if it is simply ignored and that we 
> > > use some form of in-cue markup for styling hooks.
> > 
> > The IDs are useful for referencing cues from script, so I haven't 
> > removed them. I've also left the parsing as is for when neither the 
> > first nor second line is a timing line, since that gives us a lot of 
> > headroom for future extensions (we can do anything so long as the 
> > second line doesn't start with a timestamp and "-->" and another 
> > timestamp).
> 
> In the case of feeding future extensions to current parsers, it's way 
> better fallback behavior to simply ignore the unrecognized second line 
> than to discard the entire cue. The current behavior seems unnecessarily 
> strict and makes the parser more complicated than it needs to be. My 
> preference is just ignore anything preceding the timing line, but even 
> if we must have IDs it can still be made simpler and more robust than 
> what is currently spec'ed.

If we just ignore content until we hit a line that happens to look like a 
timing line, then we are much more constrained in what we can do in the 
future. For example, we couldn't introduce a "comment block" syntax, since 
any comment containing a timing line wouldn't be ignored. On the other 
hand if we keep the syntax as it is now, we can introduce a comment block 
just by having its first line include a "-->" but not have it match the 
timestamp syntax, e.g. by having it be "--> COMMENT" or some such.

Looking at the parser more closely, I don't really see how doing anything 
more complex than skipping the block entirely would be simpler than what 
we have now, anyway.


On Mon, 3 Jan 2011, Glenn Maynard wrote:
>
> By the way, the WebSRT hit from Google 
> (http://www.whatwg.org/specs/web-apps/current-work/websrt.html) is 404.  
> I've had to read it out of the Google cache, since I'm not sure where it 
> went.

I added a redirect.


> Inline comments (not just line comments) in subtitles are very important 
> for collaborative editing: for leaving notes about a translation, noting 
> where editing is needed or why a change was made, and so on.
> 
> If a DOM-like interface is specified for this (presumably this will 
> happen later), being able to access inline comments like DOM comment 
> nodes would be very useful for visual editors, to allow displaying 
> comments and to support features like "seek to next comment".

We can add comments pretty easily (e.g. we could say that "<!" starts a 
comment and ">" ends it -- that's already being ignored by the current 
parser), if people really need them. But are comments really that useful? 
Did SRT have problem due to not supporting inline comments? (Or did it 
support inline comments?)


On Tue, 4 Jan 2011, Glenn Maynard wrote:
> On Tue, Jan 4, 2011 at 4:24 AM, Philip Jägenstedt <philipj at opera.com> 
> wrote:
> > If you need an intermediary format while editing, you can just use any 
> > syntax you like and have the editor treat it specially.
> 
> If I'd need to write my own parser to write an editor for it, that's one 
> thing--but I hope I wouldn't need to create yet another ad hoc caption 
> format, mirroring the features of this one, just to work around a lack 
> of inline comments.

An editor would need a custom parser anyway to make sure it round-tripped 
syntax errors, presumably.


> The cue text already vaguely resembles HTML.  What about <!-- comments 
> -->?  It's universally understood, and doesn't require any new escape 
> mechanisms.

The current parser would end a comment at the first ">", but so long as 
you didn't have a ">" in the comment, "<!--...-->" would work fine within 
cue text. (We would have to be careful in standalone blocks to define it 
in such a way that it could not be confused with a timing line.)


On Wed, 5 Jan 2011, Philip Jägenstedt wrote:
> 
> The question is rather if the comments should be exposed as DOM comment 
> nodes in getCueAsHTML, which seems to be what you're asking for. That 
> would only be possible if comments were only allowed inside the cue 
> text, which means that you couldn't comment out entire cues, as such:
>
> 00:00.000 --> 00:01.000
> one
> 
> /*
> 00:02.000 --> 00:03.000
> two
> */
> 
> 00:04.000 --> 00:05.000
> three
> 
> Therefore, my thinking is that comments should be removed during parsing 
> and not be exposed to any layer above it.

We can support both, if there's really demand for it.

For example:

 00:00.000 --> 00:01.000
 one <! inline comment > one
 
 COMMENT-->
 00:02.000 --> 00:03.000
 two; this is entirely
 commented out
 
 <! this is the ID line
 00:04.000 --> 00:05.000
 three; last line is a ">"
 which is part of the cue
 and is not a comment.
 >

The above would work today in a conforming UA. The question really is what 
parts of this do we want to support and what do we not care enough about.


On Wed, 5 Jan 2011, Anne van Kesteren wrote:
> On Wed, 05 Jan 2011 10:58:56 +0100, Philip Jägenstedt 
> <philipj at opera.com> wrote:
> > Therefore, my thinking is that comments should be removed during 
> > parsing and not be exposed to any layer above it.
> 
> CSS does that too. It has not caused problems so far. It does mean 
> editing tools need a slightly different DOM, but that is always the case 
> as they want to preserve whitespace details, etc., too. At least editors 
> that have both a text and visual interface.

Right.


On Fri, 14 Jan 2011, Silvia Pfeiffer wrote:
> 
> We are concerned, however, about the introduction of WebVTT as a 
> universal captioning format *when used outside browsers*. Since a subset 
> of CSS features is required to bring HTML5 video captions on par with TV 
> captions, non-browser applications will need to support these CSS 
> features, too. However, we do not believe that external CSS files are an 
> acceptable solution for non-browser captioning and therefore think that 
> those CSS features (see [1]) should eventually be made part of the 
> WebVTT specification.
> 
> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#the-'::cue'-pseudo-element

I'm not sure what you mean by "made part of the WebVTT specification", but 
if you mean that WebVTT should support inline CSS, that does seem line 
something we can add, e.g. using syntax like this:

   WEBVTT

   STYLE-->
   ::cue(v[voice=Bob]) { color: green; }
   ::cue(c.narration) { font-style: italic; }
   ::cue(c.narration i) { font-style: normal; }

   00:00.000 --> 00:02.000
   Welcome.

   00:02.500 --> 00:05.000
   To WebVTT.

I suggest we wait until WebVTT and '::cue' in particular have shipped in 
at least one browser and been demonstrated as being useful before adding 
this kind of feature though.


> 1. Introduce file-wide metadata
> 
> WebVTT requires a structure to add header-style metadata. We are here 
> talking about lists of name-value pairs as typically in use for header 
> information. The metadata can be optional, but we need a defined means 
> of adding them.
> 
> Required attributes in WebVTT files should be the main language in use 
> and the kind of data found in the WebVTT file - information that is 
> currently provided in the <track> element by the @srclang and @kind 
> attributes. These are necessary to allow the files to be interpreted 
> correctly by non-browser applications, for transcoding or to determine 
> if a file was created as a caption file or something else, in particular 
> the @kind=metadata. @srclang also sets the base directionality for BiDi 
> calculations.
>
> Further metadata fields that are typically used by authors to keep 
> specific authoring information or usage hints are necessary, too. As 
> examples of current use see the format of MPlayer mpsub¡¯s header 
> metadata [2], EBU STL¡¯s General Subtitle Information block [3], and 
> even CEA-608¡¯s Extended Data Service with its StartDate, Station, 
> Program, Category and TVRating information [4]. Rather than specifying a 
> specific subset of potential fields we recommend to just have the means 
> to provide name-value pairs and leave it to the negotiation between the 
> author and the publisher which fields they expect of each other.
>
> [2] http://www.mplayerhq.hu/DOCS/tech/mpsub.sub
> [3] https://docs.google.com/viewer?a=v&q=cache:UKnzJubrIh8J:tech.ebu.ch/docs/tech/tech3264.pdf
> [4] http://edocket.access.gpo.gov/cfr_2007/octqtr/pdf/47cfr15.119.pdf

I don't understand the use cases here.

CSS and JS don't have anything like this, why should WebVTT? What problem 
is this solving? How did SRT solve this problem?


> 2. Introduce file-wide cue settings
> 
> At the moment if authors want to change the default display of cues,
> they can only set them per cue (with the D:, S:, L:, A: and T:. cue
> settings) or have to use an external CSS file through a HTML page with
> the ::cue pseudo-element. In particular when considering that all
> Asian language files would require a ¡°D:vertical¡± marker, it becomes
> obvious that this replication of information in every cue is
> inefficient and a waste of bandwidth, storage, and application speed.
> A cue setting default section should be introduced into a file
> header/setup area of WebVTT which will avoid such replication.
> 
> An example document with cue setting defaults in the header could look
> as follows:
> ==
> WEBVTT
> Language=zh
> Kind=Caption
> CueSettings= A:end D:vertical
> 
> 00:00:15.000 --> 00:00:17.950
> ÔÚ×ó±ßÎÒÃÇ¿ÉÒÔ¿´µ½...
> 
> 00:00:18.160 --> 00:00:20.080
> ÔÚÓÒ±ßÎÒÃÇ¿ÉÒÔ¿´µ½...
> 
> 00:00:20.110 --> 00:00:21.960
> ...²¶Ó¬²Ýе.
> ==
> 
> Note that you might consider that the solution to this problem is to use 
> external CSS to specify a change to all cues. However, this is not 
> acceptable for non-browser applications and therefore not an acceptable 
> solution to this problem.

Adding defaults seems like a reasonable feature. We could add this just by 
adding the ability to have a block in a VTT file like this:

   WEBVTT

   DEFAULTS --> A:vertical A:end

   00:00.000 --> 00:02.000
   This is vertical and end-aligned.

   00:02.500 --> 00:05.000
   As is this.

   DEFAULTS --> A:start

   00:05.500 --> 00:07.000
   This is horizontal and start-aligned.

However, again I suggest that we wait until WebVTT has been deployed in at 
least one browser before adding more features like this.


> * positioning: Generally the way in which we need positioning to work is 
> to provide an anchor position for the text and then explain in which 
> direction font size changes and the addition of more text allows the 
> text segment to grow. It seems that the line position cue (L) provides a 
> baseline position and the alignment cue (A) provides the growing 
> direction start/middle/end. Can we just confirm this understanding?

It's more the other way around: the line boxes are laid out and then the 
resulting line boxes are positioned according to the A: and L: lines. In 
particular, the L: lines when given with a % character position the line 
boxes in the same manner that CSS background-position positions the 
background image, and L: lines without a % character set the position of 
the line boxes based on the height of the first line box. A: lines then 
just set the position of these line boxes relative to the other dimension.


> * fontsize: When changing text size in relation to the video changing 
> size or resolution, we need to make sure not to reduce the text size 
> below a specific font size for readability reasons. And we also need to 
> make sure not to make it larger than a specific font size, since 
> otherwise it will dominate the display. We usually want the text to be 
> at least Xpx, but no bigger than Ypx. Also, one needs to pay attention 
> to the effect that significant player size changes have on relative 
> positioning - in particular for the minimum caption text size. Dealing 
> with min and max sizes is missing from the current specification in our 
> understanding.

That's a CSS implementation issue. Minimum font sizes are commonly 
supported in CSS implementations. Maximum font sizes would be similar.


> * bidi text: In our experience from YouTube, we regularly see captions 
> that contain mixed languages/directionality, such as Hebrew captions 
> that have a word of English in it. How do we allow for bidi text inside 
> cues? How do we change directionality mid-cue? Do we deal with the 
> zero-width LTR-mark and RTL-mark unicode characters? It would be good to 
> explain how these issues are dealt with in WebVTT.

There's nothing special about how they work in WebVTT; they are handled 
the same as in CSS.


> * internationalisation: D:vertical and D:vertical-lr seem to only work 
> for vertical text - how about horizontal-rl? For example, Hebrew is a 
> prime example of a language being written from right to left 
> horizontally. Is that supported and how?

What exactly would horizontal-rl do?


> * naming: The usage of single letter abbreviations for cue settings has 
> created quite a discussion here at Google. We all agree that file-wide 
> cue settings are required and that this will reduce the need for 
> cue-specific cue settings. We can thus afford a bit more readability in 
> the cue settings. We therefore believe that it would be better if the 
> cue settings were short names rather than single letter codes. This 
> would be more like CSS, too, and easier to learn and get right. In the 
> interface description, the 5 dimensions have proper names which could be 
> re-used (¡°direction¡±, ¡°linePosition¡±, ¡°textPosition¡±, ¡°size¡± and 
> ¡°align"). We therefore recommend replacing the single-letter cue 
> commands with these longer names.

That would massively bloat these files and make editing them a huge pain, 
as far as I can tell. I agree that defaults would make it better, but many 
cues would still need their own positioning and sizing information, and 
anything beyond a very few letters would IMHO quickly become far too 
verbose for most people. "L", "A", and "S" are pretty mnemonic, "T" would 
quickly become familiar to people writing cues, and "D" is only going to 
be relevant to some authors but for those authors it's pretty 
self-explanatory as well, since the value is verbose.

What I really would like to do is use "X" and "Y" instead of "T" and "L", 
but those terms would be very confusing when we flip the direction, which 
is why I used the less obvious "T" and "L".


> * textcolor: In particular on European TV it is common to distinguish 
> between speakers by giving their speech different colors. The following 
> colors are supported by EBU STL, CEA-608 and CEA-708 and should be 
> supported in WebVTT without the use of external CSS: black, red, green, 
> yellow, blue, magenta, cyan, and white. As default we recommend white on 
> a grey transparent background.

This is supported as 'color' and 'background'.


> * underline: EBU STL, CEA-608 and CEA-708 support underlining of 
> characters.

I've added support for 'text-decoration'.


> The underline character is also particularly important for some Asian 
> languages.

Could you elaborate on this?


> Please make it possible to provide text underlines without the use of 
> CSS in WebVTT.

Why without CSS?


> * blink: As much as we would like to discourage blinking subtitles, they 
> are actually a core requirement for EBU STL and CEA-608/708 captions and 
> in use in particular for emergency messages and similar highly important 
> information. Blinking can be considered optional for implementation, but 
> we should allow for it in the standard.

This is part of 'text-decoration'.


> * font face: CEA-708 provides a choice of eight font tags: undefined, 
> monospaced serif, proportional serif, monospaced sans serif, 
> proportional sans serif, casual, cursive, small capital. These fonts 
> should be available for WebVTT as well. Is this the case?

Yes.


> We are not sure about the best solution to these needs. Would it be best 
> to introduce specific tags for these needs?

CSS seems to handle these needs adequately.


> We have a couple of recommendations for changes mostly for aesthetic and 
> efficiency reasons. We would like to point out that Google is very 
> concerned with the dense specification of data and every surplus 
> character, in particular if it is repeated a lot and doesn¡¯t fulfill a 
> need, should be removed to reduce the load created on worldwide 
> networking and storage infrastructures and help render Web pages faster.

This seems to contradict your earlier request to make the languge more 
verbose...


> * Time markers: WebVTT time stamps follow no existing standard for time 
> markers. Has the use of NPT as introduced by RTSP[5] for time markers 
> been considered (in particular npt-hhmmss)?
> 
> [5] http://www.ietf.org/rfc/rfc2326.txt

WebVTT follows the SRT format, with commas replaced by periods for 
consistency with the rest of the platform.


> * Suggest dropping ¡°-->¡±: In the context of HTML, ¡°-->¡± is an end 
> comment marker. It may confuse Web developers and parsers if such a sign 
> is used as a separator. For example, some translation tools expect HTML 
> or XML-based interchange formats and interpret the ¡°>¡± as part of a 
> tag. Also, common caption convention often uses ¡°>¡± to represent 
> speaker identification. Thus it is more difficult to write a filter 
> which correctly escapes ¡°-->¡± but retains ¡°>¡± for speaker ID.

"-->" seems pretty mnemonic to me. I don't see why we'd want to drop it.


> * Duration specification: WebVTT time stamps are always absolute time 
> stamps calculated in relation to the base time of synchronisation with 
> the media resource. While this is simple to deal with for machines, it 
> is much easier for hand-created captions to deal with relative time 
> stamps for cue end times and for the timestamp markers within cues. Cue 
> start times should continue to stay absolute time stamps. Timestamp 
> markers within cues should be relative to the cue start time. Cue end 
> times should be possible to be specified either as absolute or relative 
> timestamps. The relative time stamps could be specified through a prefix 
> of ¡°+¡± in front of a ¡°ss.mmm¡± second and millisecond specification. 
> These are not only simpler to read and author, but are also more compact 
> and therefore create smaller files.

I think if anything is absolute, it doesn't really make anything much 
simpler for anything else to be relative, to be honest. Take the example 
you give here:

> An example document with relative timestamps is:
> ==
> WEBVTT
> Language=en
> Kind=Subtitle
> 
> 00:00:15.000   +2.950
> At the left we can see...
> 
> 00:00:18.160    +1.920
> At the right we can see the...
> 
> 00:00:20.110   +1.850
> ...the <+0.400>head-<+0.800>snarlers
> ==

If the author were to change the first time stamp because the video gained 
a 30 second advertisement at the start, then he would still need to change 
the hundreds of subseqent timestamps for all the additional cues. What 
does the author gain from not having to change the relative stamps? It's 
not like he's going to be doing it by hand, and once a tool is involved, 
the tool can change everything just as easily.


> We are happy to see the introduction of the magic file identifier for 
> WebVTT which will make it easier to identify the file format. We do not 
> believe the ¡°FILE¡± part of the string is necessary.

I have removed it.


> However, we recommend to also introduce a format version number that the 
> file adheres to, e.g. ¡°WEBVTT 0.7¡±.

Version numbers are an antipattern on the Web, so I have not added one.


> This helps to make non-browser systems that parse such files become 
> aware of format changes.

The format will never change in a non-backwards-compatible fashion once it 
is deployed, so that is not a concern.


> It can also help identify proprietary standard metadata sets as used by 
> a specific company, such as ¡°WEBVTT 0.7 ABC-meta1¡± which could signify 
> that the file adheres to WEBVTT 0.7 format specification with the 
> ABC-meta1 metadata schema.

If we add metadata, then that can be handled just by having the metadata 
include that information itself.


> CEA-708 captions support automatic line wrapping in a more sophisticated 
> way than WebVTT -- see http://en.wikipedia.org/wiki/CEA-708#Word_wrap.
> 
> In our experience with YouTube we have found that in certain situations 
> this type of automatic line wrapping is very useful. Captions that were 
> authored for display in a full-screen video may contain too many words 
> to be displayed fully within the actual video presentation (note that 
> mobile / desktop / internet TV devices may each have a different amount 
> of space available, and embedded videos may be of arbitrary sizes). 
> Furthermore, user-selected fonts or font sizes may be larger than 
> expected, especially for viewers who need larger print.
> 
> WebVTT as currently specified wraps text at the edge of their containing 
> blocks, regardless of the value of the 'white-space' property, even if 
> doing so requires splitting a word where there is no line breaking 
> opportunity. This will tend to create poor quality captions.  For 
> languages where it makes sense, line wrapping should only be possible at 
> carriage return, space, or hyphen characters, but not on   
> characters.  (Note that CEA-708 also contains non-breaking space and 
> non-breaking transparent space characters to help control wrapping.) 
> However, this algorithm will not necessarily work for all languages.
> 
> We therefore suggest that a better solution for line wrapping would be 
> to use the existing line wrapping algorithms of browsers, which are 
> presumably already language-sensitive.
> 
> [Note: the YouTube line wrapping algorithm goes even further by 
> splitting single caption cues into multiple cues if there is too much 
> text to reasonably fit within the area. YouTube then adjusts the times 
> of these caption cues so they appear sequentially.  Perhaps this could 
> be mentioned as another option for server-side tools.]

I've adjusted the text in the spec to more clearly require that 
line-breaking follow normal CSS rules but with the additional requirement 
that there not be overflow, which is what I had intended.


> 1. Pop-on/paint-on/roll-up support
> 
> Three different types of captions are common on TV: pop-on, roll-up and 
> paint-on. Captions according to CEA-608/708 need to support captions of 
> all three of these types. We believe they are already supported in 
> WebVTT, but see a need to re-confirm.
> 
> For pop-on captions, a complete caption cue is timed to appear at a 
> certain time and disappear a few seconds later. This is the typical way 
> in which captions are presented and also how WebVTT/<track> works in our 
> understanding. Is this correct?

As far as I understand, yes.


> For roll-up captions, individual lines of captions are presented 
> successively with older lines moving up a line to make space for new 
> lines underneath. Assuming we understand the WebVTT rendering rules 
> correctly, WebVTT would identify each of these lines as an individual, 
> but time-overlapping cue with the other cues. As more cues are created 
> and overlap in time, newer cues are added below the currently visible 
> ones and move the currently visible ones up, basically creating a 
> roll-up effect. If this is a correct understanding, then this is an 
> acceptable means of supporting roll-up captions.

I am not aware of anything currently in the WebVTT specification which 
will cause a cue to move after it has been placed on the video, so I do 
not believe this is a correct understanding.

However, you can always have a cue be replaced by a cue with the same text 
but on a higher line, if you're willing to do some preprocessing on the 
subtitle file. It won't be a smoothly animated scroll, but it would work.

If there is convincing evidence that this kind of subtitle is used on the 
Web, though, we can support it more natively. So far I've only seen it in 
legacy scenarios that do not really map to expected WebVTT use cases.

For supporting those legacy scenarios, you need script anyway (to handle, 
e.g., backspace and moving the cursor). If you have script, doing 
scrolling is possible either by moving the cue, or by not using the 
default UA rendering of the cues at all and doing it manually (e.g. using 
<div>s or <canvas>).


> Finally, for paint-on captions, individual letters or words are 
> displayed successively on screen. WebVTT supports this functionality 
> with the cue timestamps <xx:xx:xx.xxx>, which allows to specify 
> characters or words to appear with a delay from within a cue. This 
> essentially realizes paint-on captions. Is this correct?

Yes.


> (Note that we suggest using relative timestamps inside cues to make this 
> feature more usable.)

It makes it modestly easier to do by hand, but hand-authoring a "paint-on" 
style caption seems like a world of pain regardless of the timestamp 
format we end up using, so I'm not sure it's a good argument for 
complicating the syntax with a second timestamp format.


> The HTML spec specifies that it is not allowed to have two tracks that 
> provide the same kind of data for the same language (potentially empty) 
> and for the same label (potentially empty). However, we need 
> clarification on what happens if there is a duplicate track, ie: does 
> the most recent one win or the first one or will both be made available 
> in the UI and JavaScript?

They are both available.


> The spec only states that the combination of {kind, type, label} must be 
> unique. It doesn't say what happens if they are not.

Nothing different happens if they are not than if they are. It's just a 
conformance requirement.


> Further, the spec says nothing about duplicate labels altogether - what 
> is a browser supposed to do when two tracks have been marked with the 
> same label?

That same as it does if they have different labels.


> It is very important that there is a possibility for users to 
> auto-activate tracks. Which track is chosen as the default track to 
> activate depends on the language preferences of the user. The user is 
> assumed to have a list of language preferences which leads this choice.

I've added a "default" attribute so that sites can control this.


> In YouTube, if any tracks exist that match the first language
> preference, the first of those is used as the default.  A track with
> no name sorts ahead of one with a name.  The sorting is done according
> to that language's collation order. In order to override this you
> would need (1) a default=true attribute for a track which gives it
> precedence if its language matches, and (2) a way to force the
> language preference. If no tracks exist for the first language pref,
> the second language pref is checked, and so on.
> 
> If the user's language preferences are known, and there are no tracks
> in that language, you have other options:
>   (1) offer to do auto-translation (or just do it)
>   (2) use a track in the same language that the video's audio is in (if known)
>   (3) if only one track, use the first available track
> 
> Also make sure the language choice can be overriden by the user
> through interaction.
> 
> We¡¯d like to make sure this or a similar algorithm is the recommended
> way in which browsers deal with caption tracks.

This seems to me to be a user agent quality of implementation issue. User 
preferences almost by definition can't be interoperable, so it's not 
something we can specify.


> As far as we understand, you can currently address all cues through 
> ::cue and you can address a cue part through ::cue-part(<voice> || 
> <part> || <position> || <future-compatibility>). However, if we 
> understand correctly, it doesn¡¯t seem to be possible to address an 
> individual cue through CSS, even though cues have individual 
> identifiers. This is either an oversight or a misunderstanding on our 
> parts. Can you please clarify how it is possible to address an 
> individual cue through CSS?

I've made the ID referencable from the ::cue() selector argument as an ID 
on the anonymous root element.


> Our experience with automated caption creation and positioning on 
> YouTube indicates that it is almost impossible to always place the 
> captions out of the way of where a user may be interested to look at. We 
> therefore allow users to dynamically move the caption rendering area to 
> a different viewport position to reveal what is underneath. We recommend 
> such drag-and-drop functionality also be made available for TimedTrack 
> captions on the Web, especially when no specific positioning information 
> is provided.

I've added text to explicitly allow this.


On Sat, 22 Jan 2011, Philip Jägenstedt wrote:
> 
> Indeed, repeating settings on each cue would be annoying. However, 
> file-wide settings seems like it would easily be too broad, and you'd 
> have to explicitly reverse the effect on the cues where you don't want 
> it to apply. Maybe classes of cue settings or some kind of macros would 
> work better.

My assumption is that similar cues will typically be grouped together, so 
that one could introduce the group with a "DEFAULTS" block and then 


> Nitpick: Modern Chinese, including captions, is written left-to-right, 
> top-to-bottom, just like English.

Indeed. I don't expect there will be much vertical text captioning. I 
added it primarily to support some esoteric Anime cases.



> That the intra-cue timings are relative but the timing lines are 
> absolute has bugged me a bit, so if the distinction was more obvious 
> just from the syntax, that'd be great!

They're all absolute.


> [for the file signature] "WebSRT" is prettier than "WEBSRT".

The idea is not to be pretty, the idea is to stand out. :-)


> I'm inclined to say that we should normalize all whitespace during 
> parsing and not have explicit line breaks at all. If people really want 
> two lines, they should use two cues. In practice, I don't know how well 
> that would fare, though. What other solutions are there?

I think we definitely need line breaks, e.g. for cases like:

  -- Do you want to go to the zoo?
  -- Yes!
  -- Then put your shoes on!

...which is quite common style in some locales.

However, I agree that we should encourage people to let browsers wrap the 
lines. Not sure how to encourage that more.


On Sun, 23 Jan 2011, Glenn Maynard wrote:
>
> It should be possible to specify language per-cue, or better, per block 
> of text mid-cue.  Subtitles making use of multiple languages are common, 
> and it should be possible to apply proper font selection and word 
> wrapping to all languages in use, not just the primary language.

It's not clear to me that we need language information to apply proper 
font selection and word wrapping, since CSS doesn't do it.


> When both English subtitles and Japanese captions are on screen, it 
> would be very bad to choose a Chinese font for the Japanese text, and 
> worse to choose a Western font and use it for everything, even if 
> English is the predominant language in the file.

Can't you get around this using explicit styles, e.g. against classes? 
Unless this really is going to be a common problem, I'm not particularly 
concerned about it.


On Mon, 24 Jan 2011, Philip Jägenstedt wrote:
> 
> Multi-languaged subtitles/captions seem to be extremely uncommon, 
> unsurprisingly, since you have to understand all the languages to be 
> able to read them.
> 
> The case you mention isn't a problem, you just specify Japanese as the 
> main language.

Indeed.


> There are a few other theoretical cases:
> 
> * Multi-language CJK captions. I've never seen this, but outside of 
> captioning, it seems like the foreign script is usually transcribed to 
> the native script (e.g. writing Japanese names with simplified Chinese 
> characters).
> 
> * Use of Japanese or Chinese words in a mostly non-CJK subtitles. This 
> would make correct glyph selection impossible, but I've never seen it.
> 
> * Voice synthesis of e.g. mixed English/French captions. Given that this 
> would only be useful to be people who know both languages, it seem not 
> worth complicating the format for.

Agreed on all fronts.


> Do you have any examples of real-world subtitles/captions that would 
> benefit from more fine-grained language information?

This kind of information would indeed be useful.


On Mon, 24 Jan 2011, Glenn Maynard wrote:
> 
> They're very common in anime fansubs:
> 
> http://img339.imageshack.us/img339/2681/screenshotgg.jpg
> 
> The text on the left is a transcription, the top is a transliteration, 
> and the bottom is a translation.

Aren't these three separate text tracks?


> I'm pretty sure I've also seen cases of translation notes mixing 
> languages within the same caption, eg. "jinja (神社): shrine", but 
> it's less common and I don't have an example handy.

Mixing one CJK language with one non-CJK language seems fine. That should 
always work, assuming you specify good fonts in the CSS.


> > The case you mention isn't a problem, you just specify Japanese as the 
> > main language. There are a few other theoretical cases:
> 
> Then you're indicating that English text is Japanese, which I'd expect 
> to cause UAs to render everything with a Japanese font.  That's what 
> happens when I load English text in Firefox and force SJIS: everything 
> is rendered in MS PGothic.  That's probably just what Japanese users 
> want for English text mixed in with Japanese text, too--but it's 
> generally not what English users want with the reverse.

I don't understand why we can't have good typography for CJK and non-CJK 
together. Surely there are fonts that get both right?


On Mon, 24 Jan 2011, Glenn Maynard wrote:
> >
> > [ use multiple tracks ]
>
> Personally I'd prefer that, but it would require a good deal of metadata 
> support--marking which tracks are meant to be used together, tagging 
> auxilliary track types so browsers can choose (eg. an "English subtitles 
> with no song caption tracks" option), and so on.  I'm sure that's a 
> non-starter (and I'd agree).

It's not that much metadata. It's far less effort than making the 
subtitles in the first place.


> I don't think you should need to resort to fine-grained font control to get
> reasonable default fonts.

I agree entirely, but I don't think you should need to resort to 
fine-grained language tagging either...


> The above--semantics vs. presentation--brings something else to mind.  
> One of the harder things to subtitle well is when you have two 
> conversations talking on top of each other.  This is generally done by 
> choosing a vertical spot for each conversation (generally augmented with 
> a color), so the viewer can easily follow one or the other.  Setting the 
> line position *sort of* lets you do this, but that's hard to get right, 
> since you don't know how far apart to put them.  You'd have to err 
> towards putting them too far apart (guessing the maximum number of lines 
> text might be wrapped to, and covering up much more of the screen than 
> usually needed), or putting one set on the top of the screen (making it 
> completely impossible to read both at once, rather than just 
> challenging).
> 
> If I remember correctly, SSA files do this with a hack: wherever there's 
> a blank spot in one or the other conversation, a transparent dummy cue 
> is added to keep the other conversation in the correct relative spot, so 
> the two conversations don't swap places.
> 
> I mention this because it comes to mind as something well-authored, 
> well-rendered subtitles need to get right, and I'm curious if there's a 
> reliable way to do this currently with WebVTT.  If this isn't handled, 
> some scenes just fall apart.

It's intended to be done using the L: feature to pick the lines. If the 
cues have more line wrapping than the author expected, it'll break. The 
only way around that would be to go through the whole file (or at least, 
the whole scene, somehow marked up as such) pre-rendering each cue to work 
out what the maximum line heights would be and then using that offset for 
each cue, etc, but that seems like a whole lot of complexity for a minor 
use case. Is line wrapping really going to be that unpredictable?


On Mon, 24 Jan 2011, Philip Jägenstedt wrote:
> 
> My main point here is that the use cases are so marginal. If there were 
> more compelling ones, it's not hard to support intra-cue language 
> settings using syntax like <lang en>bla</lang> or similar.

Indeed.


On Mon, 24 Jan 2011, Glenn Maynard wrote:
> 
> Here's one that I think was done very well, rendered statically to make 
> sure we're all seeing the same thing:
> 
> http://zewt.org/~glenn/multiple%20conversation%20example.mpg
> 
> The results are pretty straightforward.  One always stays on top, one 
> always stays on the bottom, and most of the time the spacing between the 
> two is correct--the normal distance the UA uses between two vertical 
> captions (which would be lost by specifying the line height explicitly).  
> Combined with the separate coloring (which is already possible, of 
> course), it's possible to read both conversations and intuitively track 
> which is which, and it's also very easy to just pick one or the other to 
> read.

As far as I can tell, the WebVTT algorithm would handle this case pretty 
well. 


> One example of how this can be tricky: at 0:17, a caption on the bottom 
> wraps and takes two lines, which then pushes the line at 0:19 upward 
> (that part's simple enough).  If instead the top part had appeared 
> first, the renderer would need to figure out in advance to push it 
> upwards, to make space for the two-line caption underneith it.  
> Otherwise, the captions would be forced to switch places.

Right, without lookahead I don't know how you'd solve it. With lookahead 
things get pretty dicey pretty quickly.


On Mon, 24 Jan 2011, Tab Atkins Jr. wrote:
> 
> Right now, the WebVTT spec handles this by writing the text in white on 
> top of a partially-transparent black background.  The text thus never 
> has contrast troubles, at the cost of a dark block covering up part of 
> the display.
> 
> Stroking text is easy, though.  Webkit has an experimental property for 
> doing it directly.  Using existing CSS, it's easy to adapt text-shadow 
> to produce a good outline - just make four shadows, offset by 1px in 
> each direction, and you're good.

WebVTT allows both text-shadow and text-outline.


On Wed, 9 Feb 2011, Silvia Pfeiffer wrote:
>
> We're trying to avoid the need for multiple transcodings and are trying 
> to achieve something like the following pipeline: broadcast captions -> 
> transcode to WebVTT -> show in browser -> transcode to broadcast devices 
> -> show

Why not just do:

   broadcast captions -> transcode to WebVTT -> show in browser

...for browsers and:

   broadcast captions -> show

...for legacy broadcast devices?


In any case the amount of legacy broadcast captions pales in comparison to 
the volume of new captions we will see for the Web. I'm not really 
convinced that legacy broadcast captions are an important concern here.


> What is the argument against using <u> in captions?

What is the argument _for_ using <u> in captions? We don't add features 
due to a lack of reasons not to. We add features due to a plethora of 
reasons to do so.


> > [ foolip suggests using multiple cues to do blinking ]
> 
> But from a captioning/subtitling point of view it's probably hard to 
> convert that back to blinking text, since we've just lost the semantic 
> by ripping it into multiple cues (and every program would use different 
> ways of doing this).

I do not think round-tripping legacy broadcast captions through WebVTT is 
an important use case. If that is something that we should support, then 
we should first establish why it is an important use case, and then 
reconsider WebVTT within that context, rather than adding features to 
handle it piecemeal.


> I guess what we are discovering is that we can define the general format 
> of WebVTT for the Web, but that there may be an additional need to 
> provide minimum implementation needs (a "profile" if you want - as much 
> as I hate this word).

Personally I have nothing against the word "profile", but I do have 
something against providing for "minimum implemenatation needs".

Interoperability means everything works the same everywhere.


> [re versioning the file format]
> In a contract between a caption provider and a caption consumer (I am 
> talking about companies here), the caption consumer will want to tell 
> the caption provider what kind of features they expect the caption files 
> to contain and features they want avoided. This links back to the 
> earlier identified need for "profiles". This is actually probably 
> something outside the scope of this group, but I am sure there is a need 
> for such a feature, in particular if we want to keep the development of 
> the WebVTT specification open for future extensions.

I don't see why there would be a need for anything beyond "make sure it 
works with deployed software", maybe with that being explicitly translated 
to specific features and workarounds for known bugs, e.g. "you can use 
ruby, but make sure you don't have timestamps out of order".

This, however, has no correlation to versions of the format.


On Mon, 14 Feb 2011, Philip Jägenstedt wrote:
> >
> > [line wrapping]
>
> There's still plenty of room for improvements in line wrapping, though. 
> It seems to me that the main reason that people line wrap captions 
> manually is to avoid getting two lines of very different length, as that 
> looks quite unbalanced. There's no way to make that happen with CSS, and 
> AFAIK it's not done by the WebVTT rendering spec either.

WebVTT just defers to CSS for this. I agree that it would be nice for CSS 
to allow UAs to do more clever things here and (more importantly) for UAs 
to actually do more clever things here.


On Tue, 15 Feb 2011, Silvia Pfeiffer wrote:
> foolip wrote:
> >
> > Sure, it's already handled by the current parsing spec, since it 
> > ignores everything up to the first blank line.
> 
> That's not quite how I'm reading the spec.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#webvtt-0
> allows
> "Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER
> TABULATION (tab) character followed by any number of characters that
> are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)
> characters."
> after the "WEBVTT FILE" magic.
> To me that reads like all of the extra stuff has to be on the same line.
> I'd prefer if this read "any character except for two WebVTT line
> terminators", then it would all be ready for such header-style
> metadata.

That's the syntax rules. It's not the parser.


> I'm told <u> is fairly common in traditional captions.

I've never seen it. Do you have any data on this?


> > Personally, I think we're going to see more and more devices running 
> > full browsers with webfonts support, and that this isn't going to be a 
> > big problem.
> 
> I tend to agree and in fact I see that as the shiny future. Just not 
> quite yet.

We're not quite at WebVTT yet either. Currently, there's more support for 
WebFonts than WebVTT.


On Tue, 15 Feb 2011, Glenn Maynard wrote:
>
> I think that, no matter what you do, people will insert line breaks in 
> cues.  I'd follow the HTML model here: convert newlines to spaces and 
> have a separate, explicit line break like <br> if needed, so people 
> don't manually line-break unless they actually mean to.

The line-breaks-are-line-breaks feature is one of the features that 
originally made SRT seem like a good idea. It still seems like the neatest 
way of having a line break.


> Related to line breaking, should there be an   escape?  Inserting 
> nbsp literally into files is somewhat annoying for authoring, since 
> they're indistinguishable from regular spaces.

How common would   be?


On Thu, 10 Feb 2011, Silvia Pfeiffer wrote:
> 
> Further discussions at Google indicate that it would be nice to make 
> more components optional. Can we have something like this:
> 
>       [[h*:]mm:]ss[.[d[c[m]]]  | s*[.d[c[m]]]
> 
> Examples:
>     23  = 23 seconds
>     23.2  = 23 sec, 1 decisec
>     1:23.45   = 1 min, 23 sec, 45 centisec
>     123.456  = 123 sec, 456 millisec

Currently the syntax is [h*:]mm:ss.sss; what's the advantage of making 
this more complicated? It's not like most subtitled clips will be shorter 
than a minute. Also, why would we want to support multiple redundant ways 
of expressing the same time? (e.g. 01:00.000 and 60.000)

Readability of VTT files seems like it would be helped by consistency, 
which suggests using the same format everywhere, as much as possible.


On Sun, 16 Jan 2011, Mark Watson wrote:
> 
> I have been looking at how the video element might work in an adaptive 
> streaming context where the available media are specified with some kind 
> of manifest file (e.g. MPEG DASH Media Presentation Description) rather 
> than in HTML.
> 
> In this context there may be choices available as to what to present, 
> many but not all related to accessibility:
>
> - multiple audio languages
> - text tracks in multiple languages
> - audio description of video
> - video with open captions (in various languages)
> - video with sign language
> - audio with directors commentary
> - etc.
> 
> It seems natural that for text tracks, loading the manifest could cause 
> the video element to be populated with associated <track> elements, 
> allowing the application to discover the choices and activate/deactivate 
> the tracks.

Not literal <track> elements, hopefully, but in-band text tracks (known as 
"media-resource-specific text track" in the spec).


> But this seems just for text tracks. I know discussions are underway on 
> what to do for other media types, but my question is whether it would be 
> better to have a consistent solution for selection amongst the available 
> media that applies for all media types ?

They're pretty different from each other, so I don't know that one 
solution would make sense for all of these.

Does the current solution (the videoTracks, audioTracks, and textTracks 
attributes) adequately address your concern?


On Mon, 17 Jan 2011, Jeroen Wijering wrote:
> 
> We are getting some questions from JW Player users that HTML5 video is 
> quite wasteful on bandwidth for longer videos (think 10min+). This 
> because browsers download the entire movie once playback starts, 
> regardless of whether a user pauses the player. If throttling is used, 
> it seems very conservative, which means a lot of unwatched video is in 
> the buffer when a user unloads a video.
> 
> I did a simple test with a 10 minute video: playing it; pausing after 30 
> seconds and checking download progress after another 30 seconds. With 
> all browsers (Firefox 4, Safari 5, Chrome 8, Opera 11, iOS 4.2), the 
> video would indeed be fully downloaded after 60 seconds. Some throttling 
> seems to be applied by Safari / iOS, but this could also be bandwidth 
> fluctuations on my side. Either way, all browsers downloaded the 10min 
> video while only 30 seconds were being watched.
> 
> The HTML5 spec is a bit generic on this topic, allowing mechanisms such 
> as stalling and throttling but not requiring them, or prescribing a 
> scripting interface:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-resource

Right, this is an area that is left up to implementations; a quality of 
implementation issue.


> A suggestion would be to implement / expose a property called 
> "downloadBufferTarget". It would be the amount of video in seconds the 
> browser tries to keep in the download buffer.

Wouldn't this be very situation-specific? e.g. if I know I'm about to go 
into a tunnel for five minutes, I want five minutes of buffered data. If 
my connection has a high packet loss rate and could stall for upwards of 
10 seconds, I want way more than 10 seconds in my buffer. If my connection 
is such that I can't download data in realtime, I want the whole video in 
my buffer. If my connection is such that I have 8ms latency to the video 
server and enough bandwidth to transfer the whole four hour file in 3 
seconds, then really I don't need anything in my buffer.


On Mon, 17 Jan 2011, Roger Hågensen wrote:
> On 2011-01-17 18:36, Markus Ernst wrote:
> > 
> > Could this be done at the user side, e.g. with some browser setting? 
> > Or even by a "stop downloading" control in the player? An intuitive 
> > user control would be separate stop and pause buttons, as we know them 
> > from tape and CD players. Pause would then behave as it does now, 
> > while stop would cancel downloading.
> 
> I think that's the right way to do it, this should be in the hands of 
> the user and exposed as a preference in the browsers.

Agreed.


> Although exposing (read only?) the user's preferred buffer setting to 
> the HTML App/Plugin etc. would be a benefit I guess as the desired 
> buffering could be communicated back to the streaming server for example 
> for a better bandwidth utilization.

How would the information be used?


On Mon, 17 Jan 2011, Zachary Ozer wrote:
>
> What no one has mentioned so far is that the real issue isn't the 
> network utilization or the memory capacity of the devices, it's 
> bandwidth cost.
> 
> The big issue for publishers is that they're incurring higher costs when 
> using the <video> tag, which is a disincentive for adoption.
> 
> Since there are situations where both the publisher and the user are 
> potentially incurring bandwidth costs (or have other limitations), we 
> could allow the publisher to specify downloadBufferTarget and the user 
> to specify a setting in the browser's config. The browser would then 
> actually buffer min(user setting, downloadBufferTarget). At that point 
> there would probably need to be another read-only property that 
> specified what value the browser is currently using as it's buffer 
> length, but maybe the getter for downloadBufferTarget is sufficient.

I think before we get something that elaborate set up, we should just try 
getting preload="" implemented. :-) That might be sufficent.


On Tue, 18 Jan 2011, Robert O'Callahan wrote:
>
> One solution that could work here is to honour dynamic changes to 
> 'preload', so switching preload to 'none' would stop buffering. Then a 
> script could do that, for example, after the user has paused the video 
> for ten seconds. The script could also look at 'buffered' to make its 
> decision.

If browsers want to do that I'm quite happy to add something explicitly to 
that effect to the spec. Right now the spec doesn't disallow it.


On Wed, 19 Jan 2011, Philip Jägenstedt wrote:
>
> The only difference between preload=none and preload=metadata is how 
> much is fetched if the user doesn't interact at all with the video. Once 
> the user has begun playing, I think the two mean the same thing: "please 
> don't waste my bandwidth more than necessary". In other words, I think 
> that for preload=metadata, browsers should be somewhat conservative even 
> after playback has begun, not going all the way to the preload=auto 
> behavior.

The descriptions are somewhat loose, but something like this could work, 
yes. (Though I'd say after playing preload=metadata and preload=auto are 
the same and preload=none is the one that says to avoid bandwidth usage, 
but that's just an artifact of the way I wrote the descriptions.)


On Tue, 18 Jan 2011, Zachary Ozer wrote:
> 
> Currently, there's no way to stop / limit the browser from buffering - 
> once you hit play, you start downloading and don't stop until the 
> resource is completely loaded. This is largely the same as Flash, save 
> the fact that some browsers don't respect the preload attribute. (Side 
> note: I also haven't found a browser that stops loading the resource 
> even if you destroy the video tag.)
> 
> There have been a few suggestions for how to deal with this, but most 
> have revolved around using downloadBufferTarget - a settable property 
> that determines how much video to buffer ahead in seconds. Originally, 
> it was suggested that the content producers should have control over 
> this, but most seem to favor the client retaining some control since 
> they are the most likely to be in low bandwidth situations. (Publishers 
> who want strict bandwidth control could use a more advanced server and 
> communication layer ala YouTube).
> 
> The simplest enhancement would be to honor the downloadBufferTarget only 
> when readyState=HAVE_ENOUGH_DATA and playback is paused, as this would 
> imply that there is not a low bandwidth situation.

It seems the simplest enhancement would be to have the browsers do the 
right thing (e.g. download enough to get to HAVE_ENOUGH_DATA and stop if 
the video is paused, or some such), not to add a feature that all Web 
authors would have to handle.


On Tue, 18 Jan 2011, Boris Zbarsky wrote:
> 
> In general, depending on finalizers to release resources (which is 
> what's happening here) is not really a workable setup.  Maybe we need an 
> api to explicitly release the data on an audio/video tag?

The spec suggests removing the element's src="" attribute and <source> 
elements and then calling the element's load() method.

The spec also suggests that implementors release all resources used by a 
media element when that media element is an orphan when the event loop 
spins.

See the "Best practices for authors using media elements" and "Best 
practices for implementors of media elements" sections.


On Wed, 19 Jan 2011, Andy Berkheimer wrote:
> 
> In the case where the viewer does not have enough bandwidth to stream
> the video in realtime, there are two basic options for the experience:
> - buffer the majority of the video (per Glenn and Boris' discussion)
> - switch to a lower bitrate that can be streamed in realtime
> 
> This thread has focused primarily of the first option and this is an 
> experience that we see quite a bit.  This is the option favored amongst 
> enthusiasts and power users, and also makes sense when a viewer has made 
> a purchase with an expectation of quality.  And there's always the 
> possibility that the user does not have enough bandwidth for even the 
> lowest available bitrate.
> 
> But the second option is the experience that the majority of our viewers 
> expect.
> 
> The ideal interface would have a reasonable default behavior but give an 
> application the ability to implement either experience depending on user 
> preference (or lack thereof), viewing context, etc.

Agreed. This is the kind of thing that a good streaming protocol can 
negotiate in realtime.


> I believe Chrome's current implementation _does_ stall the HTTP 
> connection (stop reading from the socket interface but keep it open) 
> after some amount of readahead - a magic hardcoded constant. We've run 
> into issues there - their browser readahead buffer is too small and 
> causing a lot of underruns.

It's early days. File bugs!


> No matter how much data you pass between client and server, there's 
> always some useful playback state that the client knows and the server 
> does not - or the server's view of the state is stale.  This is 
> particularly true if there's an HTTP proxy between the user agent and 
> the server.  Any behavior that could be implemented through an advanced 
> server/communication layer can be achieved in a simpler, more robust 
> fashion with a solid buffer management implementation that provides 
> "advanced" control through javascript and attributes.

The main difference is that a protocol will typically be implemented a few 
times by experienced programmers writing servers and clients, which will 
then be deployed and used by less experienced (in this kind of thing) Web 
developers, while if we just expose it to JavaScript, the people 
implementing it will be a combination of experienced library authors and 
those same Web developers, and the result will likely be less successful.

However, the two aren't mutually exclusive. We could do one and then later 
(or at the same time) do the other.


On Tue, 18 Jan 2011, Roger HÃ¥gensen wrote:
>
> It may sound odd but in low storage space situations, it may be 
> necessary to unbuffer what has been played. Is this supported at all 
> currently?

Yes.


> I think that the buffering should basically be a "moving window" (I hope 
> most here are familiar with this term?), and that the size of the moving 
> window should be determined by storage space and bandwidth and browser 
> preference and server preference, plus make sure the window supports 
> skipping anywhere without needing to buffer up to it, and avoid 
> buffering from the start just because the user skipped back a little to 
> catch something they missed (another annoyance). This is the only 
> logical way to do this really. Especially since HTTP 1.1 has byterange 
> support there is nothing preventing it from being implemented, and I 
> assume other popular streaming protocols supports byterange as well?

Implementations are allowed to do that.


On Tue, 18 Jan 2011, Silvia Pfeiffer wrote:
> 
> I think that's indeed one obvious improvement, i.e. when going to pause 
> stat, stop buffering when readyState=HAVE_ENOUGH_DATA (i.e. we have 
> reached canplaythrough state).

The spec allows this already.


> However, again, I don't think that's sufficient. Because we will also 
> buffer during playback and it is possible that we buffer fast enough to 
> have buffered e.g. the whole of a 10min video by the time we hit pause 
> after 1 min and stop watching. That's far beyond canplaythrough and 
> that's 9min worth of video download wasted bandwidth. This is where the 
> suggested downloadBufferTarget would make sense. It would basically 
> specify how much more to download beyond HAVE_ENOUGH_DATA before pausing 
> the download.

I don't understand how a site can know what the right value is for this. 
Users aren't going to understand that they have to control the buffering 
if (e.g.) they're about to go into a tunnel and they want to make sure 
it's buffered all the way through. It should just work, IMHO.


On Tue, 18 Jan 2011, David Singer wrote:
>
> If you want a more tightly coupled supply/consume protocol, then use 
> one.  As long as it's implemented by client and server, you're on.
> 
> Note that the current move of the web towards download in general and 
> HTTP in particular is due in no small part to the fact that getting more 
> tightly coupled protocols -- actually, any protocol other than HTTP -- 
> out of content servers, across firewalls, through NATs, and into clients 
> is...still a nightmare.  So, we've been given a strong incentive by all 
> those to use HTTP.  It's sad that some of them are not happy with that 
> result, but it's going to be hard to change now.

Agreed, though in practice there are certainly ways to get two-way 
protocols through. WebSocket does a pretty good job, for example. But 
designing a protocol for this is out of scope for this list, really.


On Tue, 18 Jan 2011, David Singer wrote:
> 
> In RTSP-controlled RTP, there is a tight relationship between the play 
> point, and play state, the protocol state (delivering data or paused) 
> and the data delivered (it is delivered in precisely real-time, and 
> played and discarded shortly after playing).  The server delivers very 
> little more data than is actually watched.
> 
> In HTTP, however, the entire resource is offered to the client, and 
> there is no protocol to convey play/paused back to the server, and the 
> typical behavior when offered a resource in HTTP is to make a simple 
> binary decision to either load it (all) or not load it (at all).  So, by 
> providing a media resource over HTTP, the server should kinda be 
> expecting this 'download' behavior.
> 
> Not only that, but if my client downloads as much as possible as soon as 
> possible and caches as much as possible, and yours downloads as little 
> as possible as late as possible, you may get brownie points from the 
> server owner, but I get brownie points from my local user -- the person 
> I want to please if I am a browser vendor.  There is every incentive to 
> be resilient and 'burn' bandwidth to achieve a better user experience.
> 
> Servers are at liberty to apply a 'throttle' to the supply, of course 
> ("download as fast as you like at first, but after a while I'll only 
> supply at roughly the media rate").  They can suggest that the client be 
> a little less aggressive in buffering, but it's easily ignored and the 
> incentive is to ignore it.
> 
> So I tend to return to "if you want more tightly-coupled behavior, use a 
> more tightly-coupled protocol"...

Indeed.


On Wed, 19 Jan 2011, Philip Jägenstedt wrote:
> 
> The 3 preload states imply 3 simple buffering strategies:
> 
> none: don't touch the network at all
> preload: buffer as little as possible while still reaching readyState
> HAVE_METADATA
> auto: buffer as fast and much as possible

"auto" isn't "as fast and much as possible", it's "as fast and much as 
will make the user happy". In some configurations, it might be the same as 
"none" (e.g. if the user is paying by the byte and hates video).


> However, the state we're discussing is when the user has begun playing the
> video. The spec doesn't talk about it, but I call it:
> 
> invoked: buffer as little as possible without readyState dropping below
> HAVE_FUTURE_DATA (in other words: being able to play from currentTime to
> duration at playbackRate without waiting for the network)

There's also a fifth state, let's call it "aggressive", where even while 
playing the video the UA is trying to download the whole thing in case the 
connection drops.


> If the available bandwidth exceeds the bandwidth of the resource, some 
> kind of throttling must eventually be used. There are mainly 2 options 
> for doing this:
> 
> 1. Throttle at the TCP level by not reading data from the socket (not at all
> to suspend, or at a controlled rate to buffer ahead)
> 2. Use HTTP byte ranges, making many smaller requests with any kind of
> throttling at the TCP level

There's also option 3, to handle the fifth state above: don't throttle.


> When HTTP byte ranges are used to achieve bandwidth management, it's 
> hard to talk about a single downloadBufferTarget that is the number of 
> seconds buffered ahead. Rather, there might be an upper and lower limit 
> within which the browser tries to stay, so that each request can be of a 
> reasonable size. Neither an author-provided minumum or maximum value can 
> be followed particularly closely, but could possibly be taken as a hint 
> of some sort.

Would it be a more useful hint than "preload"? I'm skeptical about adding 
many hints with no requirements. If there's some specific further 
information we can add, though, it might make sense to add more features 
to "preload".


> The above buffering strategies are still not enough, because users seem 
> to expect that in a low-bandwidth situation, the video will keep 
> buffering until they can watch it through to the end. These seem to be 
> the options for solving the problem:
> 
> * Make sites that want this behavior set .preload='auto' in the 'paused' 
> event handler
>
> * Add an option in the context menu to "Preload Video" or some such
>
> * Cause an invoked (see dfn above) but paused video to behave like 
> preload=auto
>
> * As above, but only when the available bandwidth is limited
> 
> I don't think any of these solutions are particularly good, so any input 
> on other options is very welcome!

If users expect something, it seems logical that it should just happen. I 
don't have a problem with saying that it should depend on preload="", 
though. If you like I can make the spec explicitly describe what the 
preload="" hints mean while video is playing, too.


On Wed, 19 Jan 2011, Zachary Ozer wrote:
> 
> What if, instead of trying to solve this problem, we leave it up to the 
> publishers. The current behavior would be unchanged, but we could add 
> explicit bandwidth management API calls, ie startBuffer() and 
> stopBuffer(). This would let developers / site publishers control how 
> much to buffer and when.

We couldn't depend on it (most people presumably won't want to do anything 
but give the src="" of their video).


> We might also consider leaning on users a bit to tell us what they want. 
> For example, I think people are pretty used to hitting play and then 
> pause to buffer until the end of the video. What if we just used our 
> bandwidth heuristics while in the play state, and buffered blindly when 
> a pause occurs less than X seconds into a video? I won't argue that this 
> is a wonderful solution (or a habit we should encourage), but I figured 
> I'd throw a random idea out there…

That seems like pretty ugly UI. :-)


On Thu, 20 Jan 2011, Glenn Maynard wrote:
> 
> I think that pausing shouldn't affect read-ahead buffering behavior.  
> I'd suggest another preload value, preload=buffer, sitting between 
> "metadata" and "auto".  In addition to everything loaded by "metadata", 
> it also fills the read-ahead buffer (whether the video is playing or 
> not).
> 
> - If a page wants prebuffering only (not full preloading), it sets 
> preload=buffer.  This can be done even when the video is paused, so when 
> the user presses play, the video starts instantly without pausing for a 
> server round-trip like preload=metadata.

So this would be to buffer enough to play through assuming the network 
remains at the current bandwidth, but no more?


> - If a page wants prebuffering while playing, but unlimited buffering when
> paused (per Zachary's suggestion), it sets preload=buffer when playing and
> preload=auto when paused.

Again, note that "auto" doesn't mean "buffer everything", it means "do 
whatever is best for the user".

I don't mind adding new values if the browser vendors are going to use 
them.


On Sat, 22 Jan 2011, David Singer wrote:
>
> When the HTML5 states were first proposed, I went through a careful 
> exercise to make sure that they were reasonably delivery-technology 
> neutral, i.e. that they applied equally well if say RTSP/RTP was used, 
> some kind of dynamic streaming, simple HTTP, and so on.
> 
> I am concerned that we all tend to assume that HTML==HTTP, but the 
> source URL for the media might have any protocol type, and the HTML 
> attributes, states etc. should apply (or clearly not apply) to anything.
> 
> Assuming only HTTP, in the markup, is not a good direction.

Agreed.


On Thu, 20 Jan 2011, Matthew Gregan wrote:
> 
> The media seek algorithm (4.8.10.9) states that the current playback 
> position should be set to the new playback position during the 
> asynchronous part of the algorithm, just before the seeking event is 
> fired. [...]

On Thu, 20 Jan 2011, Philip Jägenstedt wrote:
> 
> There have been two non-trivial changes to the seeking algorithm in the 
> last year:
> 
> Discussed at http://lists.w3.org/Archives/Public/public-html/2010Feb/0003.html
> lead to http://html5.org/r/4868
> 
> Discussed at http://lists.w3.org/Archives/Public/public-html/2010Jul/0217.html
> lead to http://html5.org/r/5219

Yeah. In particular, sometimes there's no way for the UA to know 
asynchronously if the seek can be done, which is why the attribute is set 
after the method returns. It's not ideal, but the alternative is not 
always implementable.


> With that said, it seems like there's nothing that guarantees that the 
> asynchronous section doesn't start running while the script is still 
> running.

Yeah. It's not ideal, but I don't really see what we can do about it.


> It's also odd that currentTime is updated before the seek has actually 
> been completed, but the reason for this is that the UI should show the 
> new position.

Not just the UI. The current position is what the browser is trying to 
play; if the current position didn't move, then the browser wouldn't be 
trying to play it.


On Fri, 4 Feb 2011, Matthew Gregan wrote:
> 
> For anyone following along, the behaviour has now been changed in the 
> Firefox 4 nightly builds.

On Mon, 24 Jan 2011, Robert O'Callahan wrote:
> 
> I agree. I think we should change behavior to match author expectations 
> and the other implementations, and let the spec change to match.

How do you handle the cases where it's not possible?


If all the browsers can do it, I'm all for going back to having 
currentTime change synchronosuly.


On Sat, 29 Jan 2011, Lubomir Toshev wrote:
> 
> [W]hen the video tag has embedded browser controls displayed and I click 
> anywhere on the controls, they cause a video tag click event. If I want 
> to toggle play/pause on video area click, then I cannot do this, because 
> clicking on the play control button, fires play, then click event fires 
> for video tag and when I toggle It pauses. So this behavior that every 
> popular flash player has cannot be achieved. There is no way to 
> understand that the click.target is the embedded browser controls area. 
> I think that a nice improvement will be to expose this information, in 
> the target, that it actually is embedded browser controls. Or clicking 
> the embedded browser controls should not produce a click event for video 
> tag. After all browser controls are native and do not have 
> representation in the DOM. Let me know what do you think about this?

On Sat, 29 Jan 2011, Aryeh Gregor wrote:
> 
> Well, to begin with, you could just use your own controls rather than 
> the browser's built-in controls.  Then you have no problem.  If you're 
> using the browser's built-in controls, maybe you should stick with the 
> browser's control conventions throughout, which presumably doesn't 
> include toggling play/pause on click.
> 
> I'm not sure this is a broad enough problem to warrant exposing the 
> extra information in the target.  Are there any other use-cases for such 
> info?

On Sun, 30 Jan 2011, Lubomir Toshev wrote:
>
> To elaborate a bit, I'm a control developer and I have my own custom 
> controls. But we want to allow for the customer to use the default 
> browser controls if they want to. This can be done by switching an 
> option in my jQuery widget - browserControls - true/false. Or through 
> browser context menu shown by default on right click. So I'm trying to 
> be flexible enough for the customer.
> 
> I was thinking about this
> 1) that adding a transparent overlay over the browser controls 
> Or
> 2) to detect the click position and if it is some pixels away from the 
> bottom of the video tag
> 
> will fix this, but every browser has different height for its embedded 
> controls and I should hardcode this height in my code, which is just not 
> manageable.
> 
> I can always add a limitation when using browser controls, toggle 
> play/pause on video area click will be turned off, but I want to achieve 
> similar behavior in all the browsers no matter whether they use embedded 
> controls or not.
> 
> So I think this tiny click.target thing will be very useful.

On Sun, 30 Jan 2011, Glenn Maynard wrote:
> 
> Even as a bad hack it's simply not possible; for example, there's no way 
> to tell whether a pop-out volume control is open or not.
> 
> I think the primary use case browser controls are meant for is when 
> scripting isn't available at all.  They aren't very useful when you're 
> using any kind of scripts with the video.  Another problem, related to 
> your other post about captioning, is that it's impossible to put 
> anything between the video and the controls, so your captions will draw 
> *on top of* browser controls.

On Mon, 31 Jan 2011, Simon Pieters wrote:
> 
> See http://lists.w3.org/Archives/Public/public-html/2009Jun/0395.html
> 
> I suggested that the browser would not generate an event at all when 
> using the native controls. Seemingly there was no reply to Hixie's 
> request for opinion from other implementors.

On Mon, 31 Jan 2011, Glenn Maynard wrote:
>
> There are other meaningful ways to respond to these events; for example, 
> to pull its container to the top of the draw order if it's a floating 
> window. I should be able to capture mousedown on the container to do 
> this, regardless of content.

On Mon, 31 Jan 2011, Simon Pieters wrote:
> 
> How about just suppressing activation events like click?

On Mon, 31 Jan 2011, Glenn Maynard wrote:
> 
> That makes more sense than suppressing the entire mousedown/mouseup 
> events (and keydown, touchstart, etc).
> 
> Also, it means you can completely emulate the event behavior of the 
> default browser controls with scripts: preventDefault on mousedown to 
> prevent click events.  That's probably not what you actually want to do, 
> but it means the default controls aren't doing anything special: their 
> effect on events can be understood entirely in terms of what scripted 
> events can already do.

On Mon, 31 Jan 2011, Lubomir Toshev wrote:
>
> I totally agree that events should not be raised, when they originate 
> from the native browser controls. This would make it much simpler. I 
> filed the same bug for Opera 11 last week.

As with the post Simon cites above, I'm happy to do this kind of thing, if 
multiple vendors agree that it makes sense. If you would like this to be 
done, I recommend getting other browser vendors to tell me it sounds good!


On Sat, 29 Jan 2011, Lubomir Toshev wrote:
> 
> [V]ideo should expose API for currentFrame, so that when control 
> developers want to add support for subtitles on their own, to be able to 
> support formats that display the subtitles according to the current 
> video frame. This is a limitation to the current design of the video 
> tag.

On Sun, 30 Jan 2011, Lubomir Toshev wrote:
>
> We were trying to add support for subtitles for our player control that 
> uses video tag as its base. There are two popular subtitle formats *.srt 
> which uses currentTime to show the subtitles where they should be. Like 
> 0:01:00 - 0:01:30 - "What a nice hotel." While the other popular format 
> is *.sub which uses the currentFrame to show the proper subtitles. Like 
> {45600}, {45689} - "What a nice hotel". And if I want to add this 
> support it would be good if video tag exposes currentFrame, so that I 
> can show properly the subtitles in a span positioned over the video. Now 
> does it make more sense?
> 
> I know video will have embedded subtitle support, but I think that it 
> should be flexible enough to allow building such features like the one 
> above. What do you think? To me this is worth adding because, it should 
> be really easy to implement?

We'll probably add that along with the metrics, when we add those, if 
there's a strong use case for it. I'm not sure that supporting frame-based 
subtitles is a good use case though.


On Mon, 14 Feb 2011, David Flanagan wrote:
>
> The draft specification defines 20+ medial event handler IDL attributes 
> on HTMLElement.  These events are non-bubbling and are always targeted 
> at <audio> and <video> tags, so I wonder if they wouldn't be better 
> defined on HTMLMediaElement instead.

All event handlers are on HTMLElement, to make implementations easier and 
to make it the platform simpler.


On Tue, 15 Feb 2011, David Flanagan wrote:
> 
> Fair enough, though I do think it will confuse developers who will think 
> that those media events bubble.  (I'll be documenting them as properties 
> of HTMLMediaElement).

Whether an event bubbles or not is up to the place that dispatches the 
event, not the place that hears the event.


> What about Document and Window?  What's the justification for defining 
> the media event handler attributes on those objects?

Same. It allows the same logic to be used everywhere.


On Mon, 14 Feb 2011, Kevin Marks wrote:
> On Mon, Feb 14, 2011 at 2:39 PM, Ian Hickson <ian at hixie.ch> wrote:
> > On Fri, 19 Nov 2010, Per-Erik Brodin wrote:
> > >
> > > We are about to start implementing stream.record() and 
> > > StreamRecorder. The spec currently says that “the file must be in 
> > > a format supported by the user agent for use in audio and video 
> > > elements” which is a reasonable restriction. However, there is 
> > > currently no way to set the output format of the resulting File that 
> > > you get from recorder.stop(). It is unlikely that specifying a 
> > > default format would be sufficient if you in addition to container 
> > > formats and codecs consider resolution, color depth, frame rate etc. 
> > > for video and sample size and rate, number of channels etc. for 
> > > audio.
> > >
> > > Perhaps an argument should be added to record() that specifies the 
> > > output format from StreamRecorder as a MIME type with parameters? 
> > > Since record() should probably throw when an unsupported type is 
> > > supplied, it would perhaps be useful to have a canRecordType() or 
> > > similar to be able to test for supported formats.
> >
> > I haven't added anything here yet, mostly because I've no idea what to 
> > add. The ideal situation here is that we have one codec that everyone 
> > can read and write and so don't need anything, but that may be 
> > hopelessly optimistic.
> 
> That isn't the ideal, as it locks us into the current state of the art 
> forever. The ideal is to enable multiple codecs +formats that can be 
> swapped out over time. That said, uncompressed audio is readily 
> codifiable, and we could pick a common file format, sample rate, 
> bitdepth and channel caount specification.

It doesn't lock us in to one format, we can always add more formats later. 
Right now, we have zero formats, so one format would be a huge step up.


On Fri, 4 Mar 2011, Philip Jägenstedt wrote:
> On Thu, 03 Mar 2011 22:15:58 +0100, Aaron Colwell <acolwell at google.com> 
> wrote:
> > 
> > I was looking at the resource fetch 
> > algorithm<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-resource>section 
> > and fetching resources 
> > <http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#fetch> 
> > sections of the HTML5 spec to determine what the proper behavior is 
> > for handling redirects. Both YouTube and Vimeo do 302 redirects to 
> > different hostnames from the URLs specified in the src attribute. It 
> > looks like the spec says that playback should fail in these cases 
> > because they are from different origins (Section 2.7 Fetching 
> > resources bullet 7). This leads me to a few questions.
> > 
> > 1. Is my interpretation of the spec correct? Sample YouTube & Vimeo URLs are
> >   shown below.
> >   YouTube : src      : http://v22.lscache6.c.youtube.com/videoplayback? ...
> >             redirect : http://tc.v22.cache6.c.youtube.com/videoplayback?
> > ...
> > 
> >   Vimeo   : src      : http://player.vimeo.com/play_redirect? ...
> >             redirect : http://av.vimeo.com/05 ...
> 
> Yes, from what I can tell you're correct, but I think it's not 
> intentional. The behavior was changed by <http://html5.org/r/5111> in 
> 2010-06-25, and this is the first time I've noticed it. Opera (and I 
> assume most if not all other browsers) already supports HTTP redirects 
> for <video> and I don't think it makes much sense to disallow it. For 
> security purposes, the origin of the resource is considered to be the 
> final destination, not any of the origins in the redirect chain.

This was fixed recently.


On Fri, 18 Mar 2011, Eric Winkelman wrote:
> 
> For in-band metadata tracks, there is neither a standard way to 
> represent the type of metadata in the HTMLTrackElement interface nor is 
> there a standard way to represent multiple different types of metadata 
> tracks.

There can be a standard way. The idea is that all the types of metadata 
tracks that browsers will support should be specified so that all browsers 
can map them the same way. I'm happy to work with anyone interested in 
writing such a mapping spec, just let me know.


> Proposal:
> 
> For TimedTextTracks with kind=metadata the @label attribute should 
> contain a MIME type for the metadata and that a track only contain Cues 
> created from metadata of that MIME type.
> 
> This implies that streams with multiple types of metadata require the 
> creation of multiple metadata track objects, one for each MIME type.

This might make sense if we had a defined way of getting such a MIME type 
(and assuming you're talking about the IDL attributes, not the content 
attributes).


On Tue, 22 Mar 2011, Eric Winkelman wrote:
> 
> Ah, yes, now I understand the confusion.  Within the whatwg specs, the 
> word "attribute" is generally used and I was trying to be consistent.

The WHATWG specs refer to content attributes (those on elements) and IDL 
attributes (those on objects, which generate properties in JS). The @foo 
syntax is never used in the WHATWG specs. It's usually used in a W3C 
context just to refer to content attributes, by analogy to the XPath 
syntax. (Personally I prefer foo="" since it's less ambiguous.)


On Mon, 21 Mar 2011, Eric Winkelman wrote:
> 
> No, I'm not saying that, but as far as I can tell from the spec, it is 
> undefined how the user agent should map in-band data to metadata tracks.  
> I am proposing that the algorithm should be that different types of data 
> should go into different Timed Text Tracks, and that the track's @label 
> should reflect the type.

To the extent that it is defined, it is defined here:

   http://www.whatwg.org/specs/web-apps/current-work/complete.html#sourcing-in-band-text-tracks

But the theory, as mentioned above, is that specific types of in-band 
metadata tracks would have explicit specs written to define how the 
mapping is done.


> Recent updates to the spec, section 4.8.10.12.2 
> (http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#sourcing-in-band-text-tracks) 
> appear to address my concern in step 2:
> 
> "2.  Set the new text track's kind, label, and language based on the 
> semantics of the relevant data, as defined by the relevant 
> specification."
> 
> Provided that the relevant specification defines the metadata type 
> encoding to be put in the label, e.g. application/x-eiss, 
> application/x-scte35, application/x-contentadvisory, etc.

Well the problem is that there typically is no applicable specification, 
or that it is too vague.


On Tue, 22 Mar 2011, Lachlan Hunt wrote:
>
> This is regarding the recently added audioTracks and videoTracks APIs to 
> the HTMLMediaElement.
> 
> The design of these APIs seems to be done a little strangely, in that 
> dealing with each track is done by passing an index to each method on 
> the TrackList interfaces, rather than treating the audioTracks and 
> videoTracks as collections of individual audio/video track objects. This 
> design is inconsistent with the design of the TextTrack interface, and 
> seems sub-optimal.

It is intended to avoid an explosion of objects. TextTrack needs to be an 
object because it has separate state, gets targetted for events, has 
different versions (e.g. MutableTextTrack), etc. Audio and Video tracks 
are, on the other hand, rather trivial constructs.


> The use of ExclusiveTrackList for videoTracks also seems rather 
> limiting. What about cases where the second video track is a 
> sign-language track, or some other video overlay.

You use a separate <video> element.

I considered this in some depth. The main problem is that you end up 
having to define a layout mechanism for videos if you allow multiple 
videos to be enabled from script (e.g. consider what the behaviour should 
be if you enable the main video, then the PiP sign language video, then 
disable the main video. What is the intrinsic dimension of the <video> 
element? Does it matter if you do it in a different order?).

By making <video> be a single video's output layer, we can bypass many of 
these problems without removing expressibility (the author can still 
support multiple PiP videos).


> There are also the use cases for controlling the volume of individual 
> tracks that are not addressed by the current spec design.

Can you elaborate on these use cases?

My assumption has been that on the long term, i you want to manipulate 
specific audio tracks, you would use an <audio> element and plug it into 
the Audio API for separate processing.


On Sat, 2 Apr 2011, Bruce Lawson wrote:
>
> From a comment in a blog post of mine about longdesc 
> (http://www.brucelawson.co.uk/2011/longdesc-in-html5/comment-page-1/#comment-749853) 
> I'm wondering if this is an appropriate used of <details>
> 
> <details>
>   <summary>
>   <img src=chart.png alt="Graph of percentage of total U.S.
> non-institutionalized population age 16-64 declaring one or more
> disabilities">
>   </summary>
> <p>The bar graph shows the percentage of total U.S. noninsitutionalized
> population age 16-64 declaring one or more disabilities. The percentage
> value for each category is as follows:</p>
> 				<ul>
> 					<li>Total declaring one or more
> disabilities: 18.6 percent </li>
> 					<li>Sensory (visual and hearing): 2.3
> percent</li>
> 					<li>Physical: 6.2 percent</li>
> 					<li>Mental: 3.8 percent</li>
> 					<li>Self-care: 1.8 percent</li>
> 					<li>Diffuculty going outside the home:
> 6.4 percent</li>
> 					<li>Employment disability: 11.9
> percent</li>
> 				</ul>
> 				<p>data retrieved from <a
> href="http://www.census.gov/prod/2003pubs/c2kbr-17.pdf" title="Link to
> External Site" class="external">2000 U.S. Census<span> -
>          external link</span></a></p>
> </details>
> 
> .. thereby acting as a discoverable-by-anyone longdesc. (The example is
> adapted from the longdesc example at
> http://webaim.org/techniques/images/longdesc#longdesc)
> 
> Note to grumpy people: I'm not trying to advocate abolishing longdesc,
> just seeeing whether details can be used as an alternative.

It's a bit weird, but sure.

(Well, except for your alt="" text, which is a title="", not an alt="".)


On Sat, 2 Apr 2011, John Foliot wrote:
> 
> Interesting question. Referring to the spec, I think that you may have 
> in fact uncovered a bug in the text. The spec states:
> 
> 	"The user agent should allow the user to request that the details 
> be shown or hidden."
> 
> The problem (or potential problem) here is that the behaviour is defined 
> in visual terms -

The spec explicitly says that these terms have non-visual meaning.


On Mon, 4 Apr 2011, Bjartur Thorlacius wrote:
>
> IMO, the specification of the <details> element is overly focused on 
> expected renderings. Rather than explicitly defining the semantics of 
> <details> with or without an @open attribute, and with or without a 
> <summary> child, sane renderings for medium to large displays whith whom 
> the user can interact are described, and usage is to be inferred 
> therefrom. This is suboptimal, as it allows hiding <details open>s on 
> small output windows but shoulds against it as strongly as ignoring 
> addition of the open attribute. Note that the <details> element 
> represents a disclosure widget, but the contents are nowhere defined 
> (neither as additional information (that a user-agent may or may not 
> render, depending on factors such as scarcity of screen estate), nor as 
> spoiling information that shouldn't be provided to the user without 
> explicit consent). I regard the two different use cases as different, 
> even though vendors might implement both with { binding: details; } on 
> some media. <Details> can't serve both. It's often spoken of as if 
> intended for something else than the YouTube video description use case. 
> <Details> mustn't be used for hiding spoilers, or else browsers won't be 
> able to intelligently choose to render the would-be concealed contents.

I've clarified <details> to be better defined in this respect. I hope it 
addresses your concern.


On Fri, 22 Apr 2011, Dimitri Glazkov wrote:
>
> I wonder if it makes sense to introduce a set of pseudo-classes on the 
> video/audio elements, each reflecting a state of the media on the 
> controls (playing/paused/error/etc.)? Then, we could use just CSS to 
> style media controls (whether native or custom), and not have to listen 
> to DOM events just to tweak their appearance.

On Sat, 23 Apr 2011, Philip Jägenstedt wrote:
> 
> With a sufficiently large set of pseudo-classes it might be possible to 
> do *display* most of the interesting state, but how would you *change* 
> the state without using scripts? Play/pause, seek, volume, etc...

On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
> 
> This is not the goal of using pseudo-classes: they just provide you with 
> a uniform way to react to changes.

On Sat, 23 Apr 2011, Philip Jägenstedt wrote:
> 
> In other words, one would still have to rely heavily on scripts to 
> actually implement custom controls?
> 
> Also, how would one style a progress bar using pseudo-classes? How about 
> a displaying elapsed/remaining time on the form MM:SS?

On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
> 
> I am not in any way trying to invent a magical way to style media 
> controls entirely in CSS. Just trying to make the job of controls 
> developers easier and use CSS where it's well... useful? :)

On Sat, 23 Apr 2011, Philip Jägenstedt wrote:
> 
> Very well, what specific set pseudo-classes do you think would be 
> useful?

On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
> 
> I can infer what would be useful from WebKit's media controls as a first 
> stab?

On Mon, 25 Apr 2011, Silvia Pfeiffer wrote:
>
> A markup and CSS example would make things clearer. How do you think it 
> would look?

On Sun, 24 Apr 2011, Dimitri Glazkov wrote:
>
> Based on WebKit's current media controls, let's start with these pseudo-classes:
> 
> Play state:
> - loading
> - playing
> - streaming
> - error
> 
> Capabilities:
> - no-audio
> - no-video
> - has-closed-captioning
> 
> So, to show a status message while the control is loading or streaming
> and hide when it's done:
> 
> video -webkit-media-controls-status-display {
>     display: none;
> }
> 
> 
> video:loading -webkit-media-controls-status-display, video:streaming
> -webkit-media-controls-status-display {
>     display: initial;
>     ...
> }
> 
> Similarly, to hide volume controls when there's no audio:
> 
> video:no-audio -webkit-media-controls-volume-slider-container {
>     display: none;
> }
> 
> Once I put these pseudo-classes in place for WebKit, a lot of the code in 
> http://codesearch.google.com/codesearch/p#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/html/shadow/MediaControlRootElement.cpp&exact_package=chromium 
> will go away, being replaced with straight CSS.

Sounds to me like a poor man's XBL. I'd much rather see this addressed 
using a full-on binding solution, since it seems like it would be only a 
little more complex yet orders of magnitude more powerful.


On Fri, 13 May 2011, Narendra Sisodiya wrote:
> 
> What i want is a general purpose synchronize mechanism when resource 
> like (text, video, graphics, etc) will be played over a general purpose 
> timer (timeline) with interaction..
> 
> Ex -
> 
>        <resource type="html" src="asd.html" x="50%"  y="50%"  width="10%"
> height="10%" z="6" xpath="page1" tIn="5000ms" tOut="9400ms"
> inEffect="fadein" outEffect="fadeout" inEffectDur="1000ms"
> outEffectDur="3000ms"/>
> 
>        <resource type="html" src="Indian.ogv" x="50%"  y="50%"  width="10%"
> height="10%" z="6" xpath="page2" tIn="5000ms" tOut="9400ms"
> inEffect="fadein" outEffect="fadeout" inEffectDur="1000ms"
> outEffectDur="3000ms"/>

Sounds like SMIL. I recommend looking into SMIL and SVG (which includes 
parts of SMIL).


On Fri, 13 May 2011, Philip Jägenstedt wrote:
>
> Problem:
> 
> <video src="video.webm"></video>
> ...
> <script>
> document.querySelector('video').oncanplay = function() {
>  /* will it run? */
> };
> </script>
> 
> In the above the canplay event can be replaced with many others, like 
> loadedmetadata and loadeddata. Whether or not the event handler has been 
> registered by the time the event is fired depends on how fast decoding 
> is, how fast the network is and how much "..." there is.

Yes, if you add an event listener in a task that runs after the task that 
fires the event could have run, you won't always catch the event.

That's just a bug in the JS.


On Fri, 13 May 2011, Henri Sivonen wrote:
> 
> <iframe src=foo.html></iframe>
> <script>
> document.querySelector('iframe').onload = function() {
>    /* will it run? */
> };
> </script>
> has the same problem. The solution is using the onload markup attribute
> that calls a function declared in an earlier <script>:
> 
> <script>
> function iframeLoaded() {
>   /* It will run! */
> }
> </script>
> <iframe src=foo.html onload=iframeLoaded()></iframe>

Exactly.


On Sat, 14 May 2011, Ojan Vafai wrote:
> 
> If someone proposed a workable solution, browser would likely implement 
> it. I can't think of a backwards-compatible solution to this, so I agree 
> that developers just need to learn the that this is a bad pattern. I 
> could imagine browsers logging a warning to the console in these cases, 
> but I worry that it would fire too much in today's web.

Indeed.


> It's unfortunate that you need to use an inline event handler instead of 
> one registered via addEventListener to avoid the race condition. 
> Exposing something to the platform like jquery's live event handlers ( 
> http://api.jquery.com/live/) could mitigate this problem in practice, 
> e.g. it would be just as easy or easier to register the event handler 
> before the element is created.

You can also work around it by setting src="" from script after you've 
used addEventListener, or by checking the state manually after you've 
added the handler and calling the handler if it is too late (though you 
have to be aware of the situation where the event is actually already 
scheduled and you added the listener between the time it was scheduled and 
the time it fired, so your function really has to be idempotent).


On Sun, 15 May 2011, Olli Pettay wrote:
> 
> There is no need to use inline event handler.
> One can always add capturing listener to window for example.
> window.addEventListener("canplay",
>   function(e) {
>     if (e.target == document.querySelector('video') {
>       // Do something.
>     }
>   }
> , true);
> And just do that before the <video> element occurs in the page.
> That is simple, IMHO.

Indeed, that is another option.


> (I wonder why the "Firing a simple event named e" defaults to 
> non-bubbling. It makes many things harder than they should be.)

The default is arbitrary and doesn't affect the platform (since I have 
to decide with each event whether to use the default or not). Changing the 
default would make no difference (I'd just have to go to every site that 
calls the algorithm and switch it from "bubbles" to nothing and nothing to 
"does not bubble").


On Sun, 15 May 2011, Glenn Maynard wrote:
> 
> If a MediaController is being used it's more complicated; there seems to 
> be no way to query the readyState of a MediaController (almost, but not 
> quite, the "most recently reported readiness state"), or to get a list 
> of slaved media elements from a MediaController without searching for 
> them by hand.

If you're scripting the MediaController, the assumption is that you 
created it so there's no problem. The impled MediaControllers are for the 
declarative case where you don't need scripting at all.


On Mon, 16 May 2011, Simon Pieters wrote:
> 
> The state can have changed before the event has actually fired, since 
> state changes are sync but the events are queued. So if the script 
> happens to run in between then func is run twice.

That's true.


On Mon, 16 May 2011, Remy Sharp wrote:
> 
> Now you're right, whoever pointed out the 7am alarm example, if you 
> attach the event too late, then you'll miss the boat.  However, it's a 
> chicken an egg situation.  You don't have the DOM so you can't attach 
> the event handler, and if you do have the DOM, the damn event has fired 
> already.
> 
> What's the fix?  Well, the work arounds are certainly viable, again from 
> an everyman developer point of view:
> 
> 1) Attach higher up, on the window object and listen for the 
> canplay/loadedmetadata/etc and check the event.target
>
> 2) Attach an inline event handler (not nice, but will do)
> 
> The fix?  Since ultimately we have exactly the same potential "bug" with 
> image load events

Not just those, also iframes, own document navigation, sockets, XHR, 
anything that does asynchronous work, in fact.


> is to update the specification and make it clear: that depending on the 
> speed of the connection and decoding, the following "xyz" events can 
> fire **before** your script runs.  Therefore, here's a couple of work 
> arounds - or just be aware.

I don't really know where to put this that would actually help.


On Tue, 17 May 2011, Philip Jägenstedt wrote:
> 
> Still, I don't think just advocacy is any kind of solution. Given that 
> you (the co-author of an HTML5 book) make certain assumptions about the 
> outcome of this race condition, it's safe to assume that hoards of web 
> developers will do the same.
> 
> To target this specific pattern, one hypothetical solution would be to 
> special-case the first script that attaches event handlers to a <video> 
> element. After it has run, all events that were already fired before the 
> script are fired again. However, this seems awfully messy if the script 
> also observes readyState or networkState. It might also interfere with 
> browsers that use scripts behind the scenes to implement the native 
> controls.
> 
> Although a kludge, another solution might be to block events from being fired
> until x more bytes of the document have been parsed or it has finished
> loading.

On Wed, 18 May 2011, Robert O'Callahan wrote:
> 
> For certain kinds of events ("load", the video events, maybe more), 
> delay the firing of such events until, say, after DOMContentLoaded has 
> fired. If you're careful you might be able to make this a strict subset 
> of the behaviors currently allowed by the spec ... i.e. you're 
> pretending that your frame, image and video loads simply didn't complete 
> until after DOMContentLoaded fired in the outer page. That would mean 
> it's compatible with properly-written legacy content ... if there is 
> any.
> 
> Of course I have no idea whether that approach is actually feasible :-). 
> It obviously isn't compatible with what browsers currently do, so 
> authors wouldn't want to rely on it for a long time if ever.

These don't seem like workable solutions. We can't delay load events for 
every image on the Web, surely. Remembering every event that's ever fired 
for any <img> or <video> just in case a handler is later attached seems a 
bit intractable, too.

This has been a problem since JavaScript was added in the 90s. I find it 
hard to believe that we have to suddenly fix it now.


On Tue, 24 May 2011, Silvia Pfeiffer wrote:
> 
> Ian and I had a brief conversation recently where I mentioned a problem 
> with extended text descriptions with screen readers (and worse still 
> with braille devices) and the suggestion was that the "paused for user 
> interaction" state of a media element may be the solution. I would like 
> to pick this up and discuss in detail how that would work to confirm my 
> sketchy understanding.
> 
> *The use case:*
> 
> In the specification for media elements we have a <track> kind of
> "descriptions", which are:
> "Textual descriptions of the video component of the media resource,
> intended for audio synthesis when the visual component is unavailable
> (e.g. because the user is interacting with the application without a
> screen while driving, or because the user is blind). Synthesized as a
> separate audio track."
> 
> I'm for now assuming that the synthesis will be done through a screen
> reader and not through the browser itself, thus making the
> descriptions available to users as synthesized audio or as braille if
> the screen reader is set up for a braille device.
> 
> The textual descriptions are provided as chunks of text with a start
> and a end time (so-called "cues"). The cues are processed during video
> playback as the video's playback time starts to fall within the time
> frame of the cue. Thus, it is expected the that cues are consumed
> during the cue's time frame and are not present any more when the end
> time of the cue is reached, so they don't conflict with the video's
> normal audio.
> 
> However, on many occasions, it is not possible to consume the cue text
> in the given time frame. In particular not in the following
> situations:
> 
> 1. The screen reader takes longer to read out the cue text than the
> cue's time frame provides for. This is particularly the case with long
> cue text, but also when the screen reader's reading rate is slower
> than what the author of the cue text expected.
> 
> 2. The braille device is used for reading. Since reading braille is
> much slower than listening to read-out text, the cue time frame will
> invariably be too short.
> 
> 3. The user seeked right into the middle of a cue and thus the time
> frame that is available for reading out the cue text is shorter than
> the cue author calculated with.
> 
> Correct me if I'm wrong, but it seems that what we need is a way for
> the screen reader to pause the video element from continuing to play
> while the screen reader is still busy delivering the cue text. (In
> a11y talk: what is required is a means to deal with "extended
> descriptions", which extend the timeline of the video.) Once it's
> finished presenting, it can resume the video element's playback.

Is it a requirement that the user be able to use the regular video pause, 
play, rewind, etc, controls to seek inside the extended descriptions, or 
should they literally pause the video while playing, with the audio 
descriptions being controlled by the same UI as the screen reader?


> IIUC, a video is "paused for user interaction" basically when the UA has 
> decided to pause the video without the user asking to pause it (i.e. the 
> paused attribute is false) and the pausing happened not for network 
> buffering reasons, but for other reasons. IIUC one concrete situation 
> where this state is used is when the UA has reached the end of the 
> resource and is waiting for more data to come (e.g. on a live stream).

That latter state is not "paused for user interaction", it's just stalled 
due to lack of data. The rest is accurate though.


> To use "paused for user interaction" for extending descriptions, we need 
> to introduce a means for the screen reader to tell the UA to pause the 
> video when it reaches the end of the cue and it's still busy delivering 
> a cue's text. Then, as it finishes, it will un-pause the video to let it 
> continue playing.
> 
> To me it sounds like a feasible solution.
> 
> The screen reader could even provide a user setting and a short-cut so a 
> user can decide that they don't want this pausing to happen or that they 
> want to move on from the current cue.
> 
> Another advantage of this approach is that e.g. a deaf-blind user could 
> hook up their braille device such that it will deliver the extended 
> descriptions and also deliver captions through braille with such 
> extension pausing happening. (Not sure that such a user would even want 
> to play the video, but it would be possible.)
> 
> Now, I think there is one problem though (at least as far as I can 
> tell). Right now, IIUC, screen readers are only passive listeners on the 
> UA. They don't influence the behaviour of the UA. The accessibility API 
> is basically only a one-way street from the UA to the AT. I wonder if 
> that is a major inhibitor of using this approach or whether it's easy 
> for UAs to overcome this limitation? (Or if such a limitation even 
> exists - I don't know enough about how AT work...).
> 
> Is that an issue? Are there other issues that I have overlooked?

That seems to be entirely an implementation issue.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list