[whatwg] <track> / WebVTT issues

Wed Sep 21 11:12:40 PDT 2011

On 21/09/11 02:15 AM, Philip Jägenstedt wrote:
> Implementors of <track> / WebVTT from several browser vendors (Opera,
> Mozilla, Google, Apple) met at the Open Video Conference recently. There
> was a session on video accessibility,[1] a bunch of new bugs were filed
> [2] and there was much rejoicing.
>
> There were a few issues that weren't concrete enough to file bugs on,
> but which I think are still worthwhile discussing further:
>
> == Comments ==
>
> If you look at the source of the spec, you'll find comments as a v2
> feature request:
>
> COMMENT -->
> this is a comment, bla bla

I don't like the format either. I do think it's very important we have 
some mechanism for multi-line file level metadata, embedded css, etc. so 
the files can live on their own.

The syntax section also suggests all metadata has to be on the signature 
line, while the parser will actually skip everything between the 
signature and the first double line terminator.

For in-caption, <! comment> is a good idea. Semantically it's a bit 
weird to not mention it in the spec, since everything else has an end 
tag, but the parser will ignore it as we want.

> The parser is fairly strict in some regards:
>
> * must use exactly 2 digits for minutes and seconds
> * minutes and seconds must be <60

I'm not normally one for restrictions, but parser also says the 
(optional) hours field must have "two or more" digits, with no maximum 
value specified.

If we all agree on an implementation limit, it could be helpful to 
specify one. Storing milliseconds in a 32 bit type gives a little over 
1000 hours of timestamps. Single-precision float runs out of useful 
precision after about 50 hours. I'd suggest a two or three digit limit 
on hours to avoid requiring a 64 bit type. If we don't care about that,
then 10 digits is a reasonable limit to avoid running out of precision 
with doubles.

> A small percentage of cues (or cue text) will be dropped because of
> these constraints and this is not very likely to be noticed unless the
> entire video+captions are watched.

This is a very good point.

> 02:00.000 --> next
> Last Chapter
>
> Cues would be created with endTime = Infinity, and be modified to the
> startTime of the following cue (in source order) if there is a following
> cue. This would IMO be quite neat, but is the use case strong enough?

This would also nicely solve the latency issue with generating live 
captions. With both use cases together, I'd be in favour of this, but we 
have other issues to address before live VTT streams work in the <track> 
element. See https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104

  -r