[whatwg] WebVTT feedback (was Re: Video feedback)

Silvia Pfeiffer silviapfeiffer1 at gmail.com
Sat Jun 4 08:05:55 PDT 2011


Hi Ian, all,

I am very excited by the possibilities that Ian outlined for WebVTT
and how we can add V2 features.

I have some comments on the discussion below, but first I'd like to
point people to a piece of work that Ronny Mennerich from
LeanbackPlayer has recently undertaken (with a little of my help).
Ronny has create this Web page:
http://leanbackplayer.com/other/webvtt.html . It summarizes the WebVTT
file format and provides visual clarifications on how the cue settings
work.

I would like to point out that Ronny has done the drawings according
to how we understand the WebVTT / HTML spec, so would appreciate
somebody checking if it's correct.

I would also like to point out the issues that Ronny lists on the
bottom of that page and that we need to resolve. I've copied them here
for discussion and added some more detail:

* A:[start|middle|end]
 -- If the [subtitle box] and also the [subtitle text] are aligned by
the designer within a CSS (file), which setting dominates: CSS or cue
setting, for both [subtitle box] and [subtitle text]?

 -- As it is text alignment, for me it is alignment of text within the
[subtitle text] element only, but not also alignment/positioning of
[subtitle text] element in relation to the [subtitle box]! However,
Silvia reckons the anchoring of the box changes with the alignment, so
that it is possible to actually middle align the [subtitle box] with
A:middle. We wonder which understanding is correct.


* T:[number]%
 -- If the [subtitle box] and also the [subtitle text] are aligned by
the designer within a CSS (file), which setting dominates: CSS or cue
setting, for both [subtitle box] and [subtitle text]?

-- What about it if "T" is used together with A:[start|middle|end]?


* S:[number]
 -- If using S:[number] without "%" (percentage) it is not clear
whether "px" or "em" is the unit for the text size.

 -- If using "em" as unit it has to be cleared how to set and
calculate the text size value! Because there is no real value, only
integer, for [number] we can not make S:1.2 so we need a note for it
like e.g. S:120 is an example value, than the text size has to be
"text-size: (120/100)em;"
If using "px" as unit it is easy, no calculation needed, [number]
could be the new text size! If e.g. S:12 is an example value, than the
text size has to be "text-size: 12px;"

* cue voice tag
 -- why are we not using voice name declaration like in the cue class
tags with a dot separation like <v.VoiceName>voice text</v> and
without spaces (eg. <v VoiceName>). This could avoid errors by .vtt
file writer and would also be much more clear to implement.

Please keep Ronny in the CC when you answer, because he is not
subscribed to the list.


Now my feedback on the WebVTT that Ian's Video feedback email provided:

On Fri, Jun 3, 2011 at 9:28 AM, Ian Hickson <ian at hixie.ch> wrote:
> On Mon, 3 Jan 2011, Philip J盲genstedt wrote:
>> >
>> > + I've added a magic string that is required on the format to make it
>> >   recognisable in environments with no or unreliable type labeling.
>>
>> Is there a reason it's "WEBVTT FILE" instead of just "WEBVTT"? "FILE"
>> seems redundant and like unnecessary typing to me.
>
> It seemed more likely that non-WebVTT files would start with a line that
> said just "WEBVTT" than a line that said just "WEBVTT FILE". But I guess
> "WEBVTT FILE FORMAT" is just as likely and it'll be caught.
>
> I've changed it to just "WEBVTT"; there may be existing implementations
> that only accept "WEBVTT FILE" so for now I recommend that authors still
> use the longer header.

I'll tweet the changes to help spread the news. I like it this short. :-)


>> > On Wed, 8 Sep 2010, Philip J盲genstedt wrote:
>> > >
>> > > In the discussion on public-html-a11y <trackgroup> was suggested to
>> > > group together mutually exclusive tracks, so that enabling one
>> > > automatically disables the others in the same trackgroup.
>> > >
>> > > I guess it's up to the UA how to enable and disable <track>s now,
>> > > but the only option is making them all mutually exclusive (as
>> > > existing players do) or a weird kind of context menu where it's
>> > > possible to enable and disable tracks completely independently.
>> > > Neither options is great, but as a user I would almost certainly
>> > > prefer all tracks being mutually exclusive and requiring scripts to
>> > > enable several at once.
>> >
>> > It's not clear to me what the use case is for having multiple groups
>> > of mutually exclusive tracks.
>> >
>> > The intent of the spec as written was that a browser would by default
>> > just have a list of all the subtitle and caption tracks (the latter
>> > with suitable icons next to them, e.g. the [CC] icon in US locales),
>> > and the user would pick one (or none) from the list. One could easily
>> > imagine a UA allowing the user to enable multiple tracks by having the
>> > user ctrl-click a menu item, though, or some similar solution, much
>> > like with the commonly seen select box UI.
>>
>> In the vast majority of cases, all tracks are intended to be mutually
>> exclusive, such as English+English HoH or subtitles in different
>> languages. No media player UI (hardware or software) that I have ever
>> used allows enabling multiple tracks at once. Without any kind of hint
>> about which tracks make sense to enable together, I can't see desktop
>> Opera allowing multiple tracks (of the same kind) to be enabled via the
>> main UI.
>
> Personally I think it's quite reasonable to want to see two languages at
> once, or even two forms of the same language at once, especially for,
> e.g., reviewing subtitles. But I don't think it would be a bad thing if
> some browsers didn't expose that in the UI; that's something that could
> be left to bookmarklets, for example.

I can particularly imagine people running the original captions in
e.g. English, and having a subtitle file in a language that they are
interested in (because they are learning it or it's their mother
tongue). But I agree that this below the 80% mark and can be done with
custom UI.


>> Using this syntax, I would expect some confusion when you omit the closing
>> </v>, when it's *not* a cue spoken by two voices at the same time, such as:
>>
>> <v Jim>- Boo!
>> <v Bob>- Gah!
>>
>> Gah! is spoken by both Jim and Bob, but that was likely not intended. If
>> this causes confusion, we should make validators warn about multiple
>> voices with with no closing </v>.
>
> No need to just warn, the spec says the above is outright invalid, so
> they would raise an error.

It would still need parsing. Do we expect it to result in
<v Jim>- Boo!</v>
<v Bob>- Gah!
or
<v Jim>- Boo!
<v Jim, Bob>- Gah!
?

Also, that raises a question: if the "annotation" in the <v> element
concerns multiple people, how do we specify that? Are we leaving this
completely to author preference or do we want it to be machine
parsable?


>> > > For captions and subtitles it's less common, but rendering it
>> > > underneath the video rather than on top of it is not uncommon, e.g.
>> > > http://nihseniorhealth.gov/video/promo_qt300.html or
>> >
>> > Conceptually, that's in the video area, it's just that the video isn't
>> > centered vertically. I suppose we could allow UAs to do that pretty
>> > easily, if it's commonly desired.
>>
>> It's already possible to align the video to the top of its content box
>> using <http://dev.w3.org/csswg/css3-images/#object-position>:
>>
>> video { object-position: center top }
>>
>> (This is already supported in Opera, but prefixed: -o-object-position)
>
> Sounds good.

I know that people have also asked to move the text actually off the
video an onto other areas on the Web page. This will have to be done
with script at the moment, which is probably ok, in particular since
you can retrieve the cue text as HTML.


>> Note that in Sweden captioning for the HoH is delivered via the teletext
>> system, which would allow ASCII-art to be displayed. Still, I've never
>> seen it. The only case of graphics being used in "subtitles" I can
>> remember ever seeing is the DVD of
>> <http://en.wikipedia.org/wiki/Cat_Soup>, where the subtitle system is
>> (ab)used to overlay some graphics.
>
> Yeah, I'm not at all concerned about not supporting graphics in subtitles.
> It's nowhere near the 80% bar.

That's what we have the metadata track type for, IIUC.


>> If we ever want comments, we need to add support in the parser before
>> any content accidentally uses the syntax, in other words pretty soon
>> now.
>
> No, we can use any syntax that the parser currently ignores. It won't
> break backwards compat with content that already uses it by then, since
> the whole point of comments is to be ignored. The only difference is
> whether validators complain or not.
>
>
>> > On Fri, 22 Oct 2010, Simon Pieters wrote:
>> > > >
>> > > > It can still be inspired by it though so we don't have to change
>> > > > much. I'd be curious to hear what other things you'd clean up
>> > > > given the chance.
>> > >
>> > > WebSRT has a number of quirks to be compatible with SRT, like
>> > > supporting both comma and dot as decimal separators, the weird
>> > > parsing of timestamps, etc.
>> >
>> > I've cleaned the timestamp parsing up. I didn't see others.
>>
>> I consider the cue id line (the line preceding the timing line) to be
>> cruft carried over from SRT. When we now both have classes and the
>> possibility of getting a cue by index, so why do we need it?
>
> It's optional, but it is useful, especially for metadata tracks, as a way
> to grab specific cues. For example, consider a metadata or chapter track
> that contains cues with specific IDs that the site would use to jump to
> particular parts of the video in response to key presses, such as "start
> of content after intro", or maybe for a podcast with different segments,
> where the user can jump to "news" and "reviews" and "final thought" -- you
> need an ID to be able to find the right cue quickly.

We even have a media fragment URI addressing approach for this:
http://www.w3.org/TR/media-frags/#naming-name
The group recently renamed #id to #chapter.
It would still link based on the ID of a cue, such as
http://example.com/video.ogv#chapter=news


>> > > There was also some discussion about metadata. Language is sometimes
>> > > necessary for the font engine to pick the right glyph.
>> >
>> > Could you elaborate on this? My assumption was that we'd just use CSS,
>> > which doesn't rely on language for this.
>>
>> It's not in any spec that I'm aware of, but some browsers (including
>> Opera) pick different glyphs depending on the language of the text,
>> which really helps when rendering CJK when you have several CJK fonts on
>> the system. Browsers will already know the language from <track
>> srclang>, so this would be for external players.
>
> How is this problem solved in SRT players today?

SRT players don't deal with it at all. They often have problems with fonts.


> On Mon, 14 Feb 2011, Philip J盲genstedt wrote:
>>
>> Given that most existing subtitle formats don't have any language
>> metadata, I'm a bit skeptical. However, if implementors of non-browser
>> players want to implement WebVTT and ask for this I won't stand in the
>> way (not that I could if I wanted to). For simplicity, I'd prefer the
>> language metadata from the file to not have any effect on browsers
>> though, even if no language is given on <track>.
>
> Indeed.

I'd say that's fine. I'd still like this information for non-Web
browsers. For example, VLC displays "Elephant's Dream" with 24
subtitle tracks in SRT as a list of track1, track2, track3 etc rather
than displaying the language that the tracks are in and I find this
rather unsatisfactory.


> On Tue, 4 Jan 2011, Alex Bishop wrote:
>>
>> Firefox too. If you visit
>> http://people.mozilla.org/~jdaggett/webfonts/serbianglyphs.html in
>> Firefox 4, the text explicitly marked-up as being Serbian Cyrillic
>> (using the lang="sr-Cyrl" attribute) uses some different glyphs to the
>> text with no language metadata.
>
> This seems to be in violation of CSS; we should probably fix it there
> before fixing it in WebVTT since WebVTT relis on CSS.

Only when used within browsers...


> On Mon, 3 Jan 2011, Philip J盲genstedt wrote:
>>
>> > > * The "bad cue" handling is stricter than it should be. After
>> > > collecting an id, the next line must be a timestamp line. Otherwise,
>> > > we skip everything until a blank line, so in the following the
>> > > parser would jump to "bad cue" on line "2" and skip the whole cue.
>> > >
>> > > 1
>> > > 2
>> > > 00:00:00.000 --> 00:00:01.000
>> > > Bla
>> > >
>> > > This doesn't match what most existing SRT parsers do, as they simply
>> > > look for timing lines and ignore everything else. If we really need
>> > > to collect the id instead of ignoring it like everyone else, this
>> > > should be more robust, so that a valid timing line always begins a
>> > > new cue. Personally, I'd prefer if it is simply ignored and that we
>> > > use some form of in-cue markup for styling hooks.
>> >
>> > The IDs are useful for referencing cues from script, so I haven't
>> > removed them. I've also left the parsing as is for when neither the
>> > first nor second line is a timing line, since that gives us a lot of
>> > headroom for future extensions (we can do anything so long as the
>> > second line doesn't start with a timestamp and "-->" and another
>> > timestamp).
>>
>> In the case of feeding future extensions to current parsers, it's way
>> better fallback behavior to simply ignore the unrecognized second line
>> than to discard the entire cue. The current behavior seems unnecessarily
>> strict and makes the parser more complicated than it needs to be. My
>> preference is just ignore anything preceding the timing line, but even
>> if we must have IDs it can still be made simpler and more robust than
>> what is currently spec'ed.
>
> If we just ignore content until we hit a line that happens to look like a
> timing line, then we are much more constrained in what we can do in the
> future. For example, we couldn't introduce a "comment block" syntax, since
> any comment containing a timing line wouldn't be ignored. On the other
> hand if we keep the syntax as it is now, we can introduce a comment block
> just by having its first line include a "-->" but not have it match the
> timestamp syntax, e.g. by having it be "--> COMMENT" or some such.
>
> Looking at the parser more closely, I don't really see how doing anything
> more complex than skipping the block entirely would be simpler than what
> we have now, anyway.

Yes, I think that can work. The pattern of a line with "-->" without
time markers is currently ignored, so we can introduce something with
it for special content like comments, style and default.


> On Mon, 3 Jan 2011, Glenn Maynard wrote:
>>
>> By the way, the WebSRT hit from Google
>> (http://www.whatwg.org/specs/web-apps/current-work/websrt.html) is 404.
>> I've had to read it out of the Google cache, since I'm not sure where it
>> went.
>
> I added a redirect.
>
>
>> Inline comments (not just line comments) in subtitles are very important
>> for collaborative editing: for leaving notes about a translation, noting
>> where editing is needed or why a change was made, and so on.
>>
>> If a DOM-like interface is specified for this (presumably this will
>> happen later), being able to access inline comments like DOM comment
>> nodes would be very useful for visual editors, to allow displaying
>> comments and to support features like "seek to next comment".
>
> We can add comments pretty easily (e.g. we could say that "<!" starts a
> comment and ">" ends it -- that's already being ignored by the current
> parser), if people really need them. But are comments really that useful?
> Did SRT have problem due to not supporting inline comments? (Or did it
> support inline comments?)

Works for me, though I don't really like that it is similar to HTML
comments but different.


> On Tue, 4 Jan 2011, Glenn Maynard wrote:
>> On Tue, Jan 4, 2011 at 4:24 AM, Philip J盲genstedt <philipj at opera.com>
>> wrote:
>> > If you need an intermediary format while editing, you can just use any
>> > syntax you like and have the editor treat it specially.
>>
>> If I'd need to write my own parser to write an editor for it, that's one
>> thing--but I hope I wouldn't need to create yet another ad hoc caption
>> format, mirroring the features of this one, just to work around a lack
>> of inline comments.
>
> An editor would need a custom parser anyway to make sure it round-tripped
> syntax errors, presumably.
>
>
>> The cue text already vaguely resembles HTML.  What about <!-- comments
>> -->?  It's universally understood, and doesn't require any new escape
>> mechanisms.
>
> The current parser would end a comment at the first ">", but so long as
> you didn't have a ">" in the comment, "<!--...-->" would work fine within
> cue text. (We would have to be careful in standalone blocks to define it
> in such a way that it could not be confused with a timing line.)

Yeah, it's that part that I don't really like.


> On Wed, 5 Jan 2011, Philip J盲genstedt wrote:
>>
>> The question is rather if the comments should be exposed as DOM comment
>> nodes in getCueAsHTML, which seems to be what you're asking for. That
>> would only be possible if comments were only allowed inside the cue
>> text, which means that you couldn't comment out entire cues, as such:
>>
>> 00:00.000 --> 00:01.000
>> one
>>
>> /*
>> 00:02.000 --> 00:03.000
>> two
>> */
>>
>> 00:04.000 --> 00:05.000
>> three
>>
>> Therefore, my thinking is that comments should be removed during parsing
>> and not be exposed to any layer above it.
>
> We can support both, if there's really demand for it.
>
> For example:
>
>  00:00.000 --> 00:01.000
>  one <! inline comment > one
>
>  COMMENT-->
>  00:02.000 --> 00:03.000
>  two; this is entirely
>  commented out
>
>  <! this is the ID line
>  00:04.000 --> 00:05.000
>  three; last line is a ">"
>  which is part of the cue
>  and is not a comment.
>  >
>
> The above would work today in a conforming UA. The question really is what
> parts of this do we want to support and what do we not care enough about.

I think both these would be good to have as they serve different uses.


> On Fri, 14 Jan 2011, Silvia Pfeiffer wrote:
>>
>> We are concerned, however, about the introduction of WebVTT as a
>> universal captioning format *when used outside browsers*. Since a subset
>> of CSS features is required to bring HTML5 video captions on par with TV
>> captions, non-browser applications will need to support these CSS
>> features, too. However, we do not believe that external CSS files are an
>> acceptable solution for non-browser captioning and therefore think that
>> those CSS features (see [1]) should eventually be made part of the
>> WebVTT specification.
>>
>> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#the-'::cue'-pseudo-element
>
> I'm not sure what you mean by "made part of the WebVTT specification", but
> if you mean that WebVTT should support inline CSS, that does seem line
> something we can add, e.g. using syntax like this:
>
>   WEBVTT
>
>   STYLE-->
>   ::cue(v[voice=Bob]) { color: green; }
>   ::cue(c.narration) { font-style: italic; }
>   ::cue(c.narration i) { font-style: normal; }

Yup, that's exactly what we need.


>   00:00.000 --> 00:02.000
>   Welcome.
>
>   00:02.500 --> 00:05.000
>   To WebVTT.
>
> I suggest we wait until WebVTT and '::cue' in particular have shipped in
> at least one browser and been demonstrated as being useful before adding
> this kind of feature though.

Fair enough.


>> 1. Introduce file-wide metadata
>>
>> WebVTT requires a structure to add header-style metadata. We are here
>> talking about lists of name-value pairs as typically in use for header
>> information. The metadata can be optional, but we need a defined means
>> of adding them.
>>
>> Required attributes in WebVTT files should be the main language in use
>> and the kind of data found in the WebVTT file - information that is
>> currently provided in the <track> element by the @srclang and @kind
>> attributes. These are necessary to allow the files to be interpreted
>> correctly by non-browser applications, for transcoding or to determine
>> if a file was created as a caption file or something else, in particular
>> the @kind=metadata. @srclang also sets the base directionality for BiDi
>> calculations.
>>
>> Further metadata fields that are typically used by authors to keep
>> specific authoring information or usage hints are necessary, too. As
>> examples of current use see the format of MPlayer mpsub’s header
>> metadata [2], EBU STL’s General Subtitle Information block [3], and
>> even CEA-608’s Extended Data Service with its StartDate, Station,
>> Program, Category and TVRating information [4]. Rather than specifying a
>> specific subset of potential fields we recommend to just have the means
>> to provide name-value pairs and leave it to the negotiation between the
>> author and the publisher which fields they expect of each other.
>>
>> [2] http://www.mplayerhq.hu/DOCS/tech/mpsub.sub
>> [3] https://docs.google.com/viewer?a=v&q=cache:UKnzJubrIh8J:tech.ebu.ch/docs/tech/tech3264.pdf
>> [4] http://edocket.access.gpo.gov/cfr_2007/octqtr/pdf/47cfr15.119.pdf
>
> I don't understand the use cases here.
>
> CSS and JS don't have anything like this, why should WebVTT? What problem
> is this solving? How did SRT solve this problem?

SRT doesn't solve it. That's why it's not being used by professionals
for subtitling. Most other subtitling formats, however, have means for
including metadata, including formats like LRC for music lyrics. CSS
and JS don't have metadata, but HTML has through the meta tag.


>> 2. Introduce file-wide cue settings
>>
>> At the moment if authors want to change the default display of cues,
>> they can only set them per cue (with the D:, S:, L:, A: and T:. cue
>> settings) or have to use an external CSS file through a HTML page with
>> the ::cue pseudo-element. In particular when considering that all
>> Asian language files would require a “D:vertical” marker, it becomes
>> obvious that this replication of information in every cue is
>> inefficient and a waste of bandwidth, storage, and application speed.
>> A cue setting default section should be introduced into a file
>> header/setup area of WebVTT which will avoid such replication.
>>
>> An example document with cue setting defaults in the header could look
>> as follows:
>> ==
>> WEBVTT
>> Language=zh
>> Kind=Caption
>> CueSettings= A:end D:vertical
>>
>> 00:00:15.000 --> 00:00:17.950
>> 在左边我们可以看到...
>>
>> 00:00:18.160 --> 00:00:20.080
>> 在右边我们可以看到...
>>
>> 00:00:20.110 --> 00:00:21.960
>> ...捕蝇草械.
>> ==
>>
>> Note that you might consider that the solution to this problem is to use
>> external CSS to specify a change to all cues. However, this is not
>> acceptable for non-browser applications and therefore not an acceptable
>> solution to this problem.
>
> Adding defaults seems like a reasonable feature. We could add this just by
> adding the ability to have a block in a VTT file like this:
>
>   WEBVTT
>
>   DEFAULTS --> A:vertical A:end
>
>   00:00.000 --> 00:02.000
>   This is vertical and end-aligned.
>
>   00:02.500 --> 00:05.000
>   As is this.
>
>   DEFAULTS --> A:start
>
>   00:05.500 --> 00:07.000
>   This is horizontal and start-aligned.
>
> However, again I suggest that we wait until WebVTT has been deployed in at
> least one browser before adding more features like this.

This is a good idea. Happy to wait, though there are now
implementations that are starting to emerge and in particular these
DEFAULTS will be very useful to reduce repetition in authoring from
the start.


>> * positioning: Generally the way in which we need positioning to work is
>> to provide an anchor position for the text and then explain in which
>> direction font size changes and the addition of more text allows the
>> text segment to grow. It seems that the line position cue (L) provides a
>> baseline position and the alignment cue (A) provides the growing
>> direction start/middle/end. Can we just confirm this understanding?
>
> It's more the other way around: the line boxes are laid out and then the
> resulting line boxes are positioned according to the A: and L: lines. In
> particular, the L: lines when given with a % character position the line
> boxes in the same manner that CSS background-position positions the
> background image, and L: lines without a % character set the position of
> the line boxes based on the height of the first line box. A: lines then
> just set the position of these line boxes relative to the other dimension.

There are examples in Ronny's page and it would be good to check if
they were done correctly, so we can come to a common understanding and
even include such examples in the spec to clarify the meaning of the
cue settings.

There are particularly questions about what L:100% and T:100% mean -
do they position the boxes outside the video viewport?


>> * fontsize: When changing text size in relation to the video changing
>> size or resolution, we need to make sure not to reduce the text size
>> below a specific font size for readability reasons. And we also need to
>> make sure not to make it larger than a specific font size, since
>> otherwise it will dominate the display. We usually want the text to be
>> at least Xpx, but no bigger than Ypx. Also, one needs to pay attention
>> to the effect that significant player size changes have on relative
>> positioning - in particular for the minimum caption text size. Dealing
>> with min and max sizes is missing from the current specification in our
>> understanding.
>
> That's a CSS implementation issue. Minimum font sizes are commonly
> supported in CSS implementations. Maximum font sizes would be similar.


OK, seems we can get it with CSS, though not specify it with "S" cue setting.

Incidentally: would it make sense to have a pixel-based (or em-based)
font size specification for "S" as well as the percentage based one?


>> * bidi text: In our experience from YouTube, we regularly see captions
>> that contain mixed languages/directionality, such as Hebrew captions
>> that have a word of English in it. How do we allow for bidi text inside
>> cues? How do we change directionality mid-cue? Do we deal with the
>> zero-width LTR-mark and RTL-mark unicode characters? It would be good to
>> explain how these issues are dealt with in WebVTT.
>
> There's nothing special about how they work in WebVTT; they are handled
> the same as in CSS.

Except that CSS has a directionality property. So, we basically are
restricted to the UTF-8 directionality characters. These are probably
sufficient.


>> * internationalisation: D:vertical and D:vertical-lr seem to only work
>> for vertical text - how about horizontal-rl? For example, Hebrew is a
>> prime example of a language being written from right to left
>> horizontally. Is that supported and how?
>
> What exactly would horizontal-rl do?

I was under the impression that we need a @dir attribute to provide
rtl text. But I am now aware of the UTF-8 characters for
directionality and they are indeed better to support than more markup.


>> * naming: The usage of single letter abbreviations for cue settings has
>> created quite a discussion here at Google. We all agree that file-wide
>> cue settings are required and that this will reduce the need for
>> cue-specific cue settings. We can thus afford a bit more readability in
>> the cue settings. We therefore believe that it would be better if the
>> cue settings were short names rather than single letter codes. This
>> would be more like CSS, too, and easier to learn and get right. In the
>> interface description, the 5 dimensions have proper names which could be
>> re-used (“direction”, “linePosition”, “textPosition”, “size” and
>> “align"). We therefore recommend replacing the single-letter cue
>> commands with these longer names.
>
> That would massively bloat these files and make editing them a huge pain,
> as far as I can tell. I agree that defaults would make it better, but many
> cues would still need their own positioning and sizing information, and
> anything beyond a very few letters would IMHO quickly become far too
> verbose for most people. "L", "A", and "S" are pretty mnemonic, "T" would
> quickly become familiar to people writing cues, and "D" is only going to
> be relevant to some authors but for those authors it's pretty
> self-explanatory as well, since the value is verbose.

It took me 6 months before I got used to them for authoring subtitle
files, but indeed I have grown accustomed and can deal with them now.


> What I really would like to do is use "X" and "Y" instead of "T" and "L",
> but those terms would be very confusing when we flip the direction, which
> is why I used the less obvious "T" and "L".

Yeah, don't change them again.


>> * textcolor: In particular on European TV it is common to distinguish
>> between speakers by giving their speech different colors. The following
>> colors are supported by EBU STL, CEA-608 and CEA-708 and should be
>> supported in WebVTT without the use of external CSS: black, red, green,
>> yellow, blue, magenta, cyan, and white. As default we recommend white on
>> a grey transparent background.
>
> This is supported as 'color' and 'background'.

OK.


>> * underline: EBU STL, CEA-608 and CEA-708 support underlining of
>> characters.
>
> I've added support for 'text-decoration'.

And for <u>. I am happy now, thanks. :-)


>> We have a couple of recommendations for changes mostly for aesthetic and
>> efficiency reasons. We would like to point out that Google is very
>> concerned with the dense specification of data and every surplus
>> character, in particular if it is repeated a lot and doesn’t fulfill a
>> need, should be removed to reduce the load created on worldwide
>> networking and storage infrastructures and help render Web pages faster.
>
> This seems to contradict your earlier request to make the languge more
> verbose...

Yeah, we had that discussion at Google, too, but the frequency of
"-->" is much higher than the frequency of cue settings, in particular
once we remove duplication needs.

I personally think we need something that's fairly unique to identify
a line of start/end time markers. "-->", while not pretty and somewhat
conflicted with HTML comments, can work.


>> * Duration specification: WebVTT time stamps are always absolute time
>> stamps calculated in relation to the base time of synchronisation with
>> the media resource. While this is simple to deal with for machines, it
>> is much easier for hand-created captions to deal with relative time
>> stamps for cue end times and for the timestamp markers within cues. Cue
>> start times should continue to stay absolute time stamps. Timestamp
>> markers within cues should be relative to the cue start time. Cue end
>> times should be possible to be specified either as absolute or relative
>> timestamps. The relative time stamps could be specified through a prefix
>> of “+” in front of a “ss.mmm” second and millisecond specification.
>> These are not only simpler to read and author, but are also more compact
>> and therefore create smaller files.
>
> I think if anything is absolute, it doesn't really make anything much
> simpler for anything else to be relative, to be honest. Take the example
> you give here:
>
>> An example document with relative timestamps is:
>> ==
>> WEBVTT
>> Language=en
>> Kind=Subtitle
>>
>> 00:00:15.000   +2.950
>> At the left we can see...
>>
>> 00:00:18.160    +1.920
>> At the right we can see the...
>>
>> 00:00:20.110   +1.850
>> ...the <+0.400>head-<+0.800>snarlers
>> ==
>
> If the author were to change the first time stamp because the video gained
> a 30 second advertisement at the start, then he would still need to change
> the hundreds of subseqent timestamps for all the additional cues. What
> does the author gain from not having to change the relative stamps? It's
> not like he's going to be doing it by hand, and once a tool is involved,
> the tool can change everything just as easily.

Much will continue to be hand-coded. Also, the duration becomes
visible. Also, many more different timecode specifications in other
formats than SRT exist that use such relative timestamps, so allowing
them would make transcoding easier.

But we can leave this for now - having multiple formats does indeed
make parsing harder.


>> We are happy to see the introduction of the magic file identifier for
>> WebVTT which will make it easier to identify the file format. We do not
>> believe the “FILE” part of the string is necessary.
>
> I have removed it.

Thanks. You should also remove the text ", or the seventh character is
neither a U+0020 SPACE character nor a U+0009 CHARACTER TABULATION
(tab) character," from step 7 of the parsing, since such a seventh
character does not need to exist at all.


>> However, we recommend to also introduce a format version number that the
>> file adheres to, e.g. “WEBVTT 0.7”.
>
> Version numbers are an antipattern on the Web, so I have not added one.

We can have it underneath the file magic in another line of the header
now where the metadata will be (or will be when we make V2 of the
format ;-), so that's fine.


>> It can also help identify proprietary standard metadata sets as used by
>> a specific company, such as “WEBVTT 0.7 ABC-meta1” which could signify
>> that the file adheres to WEBVTT 0.7 format specification with the
>> ABC-meta1 metadata schema.
>
> If we add metadata, then that can be handled just by having the metadata
> include that information itself.

Yes, that's right and it's also a good enough solution.


> I've adjusted the text in the spec to more clearly require that
> line-breaking follow normal CSS rules but with the additional requirement
> that there not be overflow, which is what I had intended.

Ah, ok. Thanks for clarifying.


>> 1. Pop-on/paint-on/roll-up support
>>
[..]
>> For roll-up captions, individual lines of captions are presented
>> successively with older lines moving up a line to make space for new
>> lines underneath. Assuming we understand the WebVTT rendering rules
>> correctly, WebVTT would identify each of these lines as an individual,
>> but time-overlapping cue with the other cues. As more cues are created
>> and overlap in time, newer cues are added below the currently visible
>> ones and move the currently visible ones up, basically creating a
>> roll-up effect. If this is a correct understanding, then this is an
>> acceptable means of supporting roll-up captions.
>
> I am not aware of anything currently in the WebVTT specification which
> will cause a cue to move after it has been placed on the video, so I do
> not believe this is a correct understanding.

Hmm, this is a problem. Because text grows generally by adding more
lines underneath a given one. If we don't allow changing positions of
captions, then we cannot move text up when another cue is added in
exactly the same position (typically the bottom center of the video
viewport). Roll-up is very typical on TV, in particular for everything
that is live captioned. Marking it up with text repetitions in
subsequent cues is a very bad hack for trying to get roll-up
functionality like this:

00:00.000 --> 00:01.000
one

00:01.000 --> 00:03.000
one
two

00:03.000 --> 00:05.000
two
three

So, I'd prefer if we specified it such that a cue can move out of the
way if another cue is rendered in the same location - up for
horizontal rendering, left for vertical rendering, and right for
vertical-ltr rendering. If it drops off the top or sides of the
viewport by doing so, well that's an author's error.


> However, you can always have a cue be replaced by a cue with the same text
> but on a higher line, if you're willing to do some preprocessing on the
> subtitle file. It won't be a smoothly animated scroll, but it would work.
>
> If there is convincing evidence that this kind of subtitle is used on the
> Web, though, we can support it more natively. So far I've only seen it in
> legacy scenarios that do not really map to expected WebVTT use cases.

It's been used in live Web captioning scenarios. For example, the live
captioning that Google used at Google I/O was provided by StreamText
and their captions are provided as paint-on with text growing
underneath. See http://streamtext.net/demos/video-captions.aspx for an
example. I'll take a guess that there are many other services that do
it in a similar fashion.


> For supporting those legacy scenarios, you need script anyway (to handle,
> e.g., backspace and moving the cursor). If you have script, doing
> scrolling is possible either by moving the cue, or by not using the
> default UA rendering of the cues at all and doing it manually (e.g. using
> <div>s or <canvas>).

Complex editing effects such as backspace and cursor display and
moving the cursor around are not things that are used in live webcast
captioning scenarios, so I am not worried about replicating them. But
the addition of caption cues underneath existing displayed cues and
moving existing cues up is important. Further, it would make sense to
have a limit on the number of scrollings that are possible such that a
previously moved up caption cue is only moved up for a specified
limited number of lines (e.g. max 4 lines).


>> Finally, for paint-on captions, individual letters or words are
>> displayed successively on screen. WebVTT supports this functionality
>> with the cue timestamps <xx:xx:xx.xxx>, which allows to specify
>> characters or words to appear with a delay from within a cue. This
>> essentially realizes paint-on captions. Is this correct?
>
> Yes.
>
>
>> (Note that we suggest using relative timestamps inside cues to make this
>> feature more usable.)
>
> It makes it modestly easier to do by hand, but hand-authoring a "paint-on"
> style caption seems like a world of pain regardless of the timestamp
> format we end up using, so I'm not sure it's a good argument for
> complicating the syntax with a second timestamp format.

Fair enough. Though I think we may want to add this feature in the future.


>> As far as we understand, you can currently address all cues through
>> ::cue and you can address a cue part through ::cue-part(<voice> ||
>> <part> || <position> || <future-compatibility>). However, if we
>> understand correctly, it doesn’t seem to be possible to address an
>> individual cue through CSS, even though cues have individual
>> identifiers. This is either an oversight or a misunderstanding on our
>> parts. Can you please clarify how it is possible to address an
>> individual cue through CSS?
>
> I've made the ID referencable from the ::cue() selector argument as an ID
> on the anonymous root element.

Excellent, thanks.


>> Our experience with automated caption creation and positioning on
>> YouTube indicates that it is almost impossible to always place the
>> captions out of the way of where a user may be interested to look at. We
>> therefore allow users to dynamically move the caption rendering area to
>> a different viewport position to reveal what is underneath. We recommend
>> such drag-and-drop functionality also be made available for TimedTrack
>> captions on the Web, especially when no specific positioning information
>> is provided.
>
> I've added text to explicitly allow this.

OK.


> On Sun, 23 Jan 2011, Glenn Maynard wrote:
>>
>> It should be possible to specify language per-cue, or better, per block
>> of text mid-cue.  Subtitles making use of multiple languages are common,
>> and it should be possible to apply proper font selection and word
>> wrapping to all languages in use, not just the primary language.
>
> It's not clear to me that we need language information to apply proper
> font selection and word wrapping, since CSS doesn't do it.

How would we mark up mid-cue text with a different language than the
text around it? Directionality can be solved with a UTF-8 character,
so the language display should be fine. But I am worried about the
case where we use WebVTT for "descriptions" and a screen reader is
supposed to read out the text. It needs to choose the correct language
model to read out the cue text and it will only get that with language
markup.


>> When both English subtitles and Japanese captions are on screen, it
>> would be very bad to choose a Chinese font for the Japanese text, and
>> worse to choose a Western font and use it for everything, even if
>> English is the predominant language in the file.
>
> Can't you get around this using explicit styles, e.g. against classes?
> Unless this really is going to be a common problem, I'm not particularly
> concerned about it.
>
>
> On Mon, 24 Jan 2011, Philip J盲genstedt wrote:
>>
>> Multi-languaged subtitles/captions seem to be extremely uncommon,
>> unsurprisingly, since you have to understand all the languages to be
>> able to read them.
>>
>> The case you mention isn't a problem, you just specify Japanese as the
>> main language.
>
> Indeed.
>
>
>> There are a few other theoretical cases:
>>
>> * Multi-language CJK captions. I've never seen this, but outside of
>> captioning, it seems like the foreign script is usually transcribed to
>> the native script (e.g. writing Japanese names with simplified Chinese
>> characters).
>>
>> * Use of Japanese or Chinese words in a mostly non-CJK subtitles. This
>> would make correct glyph selection impossible, but I've never seen it.
>>
>> * Voice synthesis of e.g. mixed English/French captions. Given that this
>> would only be useful to be people who know both languages, it seem not
>> worth complicating the format for.
>
> Agreed on all fronts.

I disagree with the third case. Many people speak more than one
language and even if they don't speak the language that is in use in a
cue, it is still bad to render it in using the wrong language model,
in particular if it is rendered by a screen reader. We really need a
mechanism to attach a language marker to a cue segment.


>> Do you have any examples of real-world subtitles/captions that would
>> benefit from more fine-grained language information?
>
> This kind of information would indeed be useful.

Note that I'm not so much worried about captions and subtitles here,
but rather worried about audio descriptions as rendered from cue text
descriptions.

Glenn may have a different opinion on mixed language subtitles/captions though.


> On Mon, 24 Jan 2011, Glenn Maynard wrote:
>>
>> They're very common in anime fansubs:
>>
>> http://img339.imageshack.us/img339/2681/screenshotgg.jpg
>>
>> The text on the left is a transcription, the top is a transliteration,
>> and the bottom is a translation.
>
> Aren't these three separate text tracks?

I would think they are three different cues, but not necessarily three
different tracks.


>> I'm pretty sure I've also seen cases of translation notes mixing
>> languages within the same caption, eg. "jinja (绁炵ぞ): shrine", but
>> it's less common and I don't have an example handy.
>
> Mixing one CJK language with one non-CJK language seems fine. That should
> always work, assuming you specify good fonts in the CSS.
>
>
>> > The case you mention isn't a problem, you just specify Japanese as the
>> > main language. There are a few other theoretical cases:
>>
>> Then you're indicating that English text is Japanese, which I'd expect
>> to cause UAs to render everything with a Japanese font.  That's what
>> happens when I load English text in Firefox and force SJIS: everything
>> is rendered in MS PGothic.  That's probably just what Japanese users
>> want for English text mixed in with Japanese text, too--but it's
>> generally not what English users want with the reverse.
>
> I don't understand why we can't have good typography for CJK and non-CJK
> together. Surely there are fonts that get both right?
>
>
> On Mon, 24 Jan 2011, Philip J盲genstedt wrote:
>>
>> My main point here is that the use cases are so marginal. If there were
>> more compelling ones, it's not hard to support intra-cue language
>> settings using syntax like <lang en>bla</lang> or similar.
>
> Indeed.


A <lang> tag would indeed be useful.


> On Mon, 24 Jan 2011, Glenn Maynard wrote:
>>
>> Here's one that I think was done very well, rendered statically to make
>> sure we're all seeing the same thing:
>>
>> http://zewt.org/~glenn/multiple%20conversation%20example.mpg
>>
>> The results are pretty straightforward.  One always stays on top, one
>> always stays on the bottom, and most of the time the spacing between the
>> two is correct--the normal distance the UA uses between two vertical
>> captions (which would be lost by specifying the line height explicitly).
>> Combined with the separate coloring (which is already possible, of
>> course), it's possible to read both conversations and intuitively track
>> which is which, and it's also very easy to just pick one or the other to
>> read.
>
> As far as I can tell, the WebVTT algorithm would handle this case pretty
> well.
>
>
>> One example of how this can be tricky: at 0:17, a caption on the bottom
>> wraps and takes two lines, which then pushes the line at 0:19 upward
>> (that part's simple enough).  If instead the top part had appeared
>> first, the renderer would need to figure out in advance to push it
>> upwards, to make space for the two-line caption underneith it.
>> Otherwise, the captions would be forced to switch places.
>
> Right, without lookahead I don't know how you'd solve it. With lookahead
> things get pretty dicey pretty quickly.

If we introduced the scrolling behaviour that I described above where
cues that are rendered into the same location as a previous still
active cue push that previous cue up, we get this behaviour covered
too.


> On Wed, 9 Feb 2011, Silvia Pfeiffer wrote:
>>
>> We're trying to avoid the need for multiple transcodings and are trying
>> to achieve something like the following pipeline: broadcast captions ->
>> transcode to WebVTT -> show in browser -> transcode to broadcast devices
>> -> show
>
> Why not just do:
>
>   broadcast captions -> transcode to WebVTT -> show in browser
>
> ...for browsers and:
>
>   broadcast captions -> show
>
> ...for legacy broadcast devices?

Eventually, we will want to get rid of the legacy format and just
deliver WebVTT, but they still need to display as though they came
from the original broadcast caption format for contractual reasons.


> In any case the amount of legacy broadcast captions pales in comparison to
> the volume of new captions we will see for the Web. I'm not really
> convinced that legacy broadcast captions are an important concern here.

There still exists a large amount of legacy broadcast captions and
broadcasters are concerned about their legacy content.


>> What is the argument against using <u> in captions?
>
> What is the argument _for_ using <u> in captions? We don't add features
> due to a lack of reasons not to. We add features due to a plethora of
> reasons to do so.
>
>
>> > [ foolip suggests using multiple cues to do blinking ]
>>
>> But from a captioning/subtitling point of view it's probably hard to
>> convert that back to blinking text, since we've just lost the semantic
>> by ripping it into multiple cues (and every program would use different
>> ways of doing this).
>
> I do not think round-tripping legacy broadcast captions through WebVTT is
> an important use case. If that is something that we should support, then
> we should first establish why it is an important use case, and then
> reconsider WebVTT within that context, rather than adding features to
> handle it piecemeal.
>
>
>> I guess what we are discovering is that we can define the general format
>> of WebVTT for the Web, but that there may be an additional need to
>> provide minimum implementation needs (a "profile" if you want - as much
>> as I hate this word).
>
> Personally I have nothing against the word "profile", but I do have
> something against providing for "minimum implemenatation needs".
>
> Interoperability means everything works the same everywhere.

Yeah, what I meant was that we can create CSS sets that e.g. equate to
the needs of broadcast captions and thus provide a "broadcast caption
profile". Any application (also non-browsers) would then just need to
support that subset of CSS and not the full CSS functionality to
support that profile.


> On Thu, 10 Feb 2011, Silvia Pfeiffer wrote:
>>
>> Further discussions at Google indicate that it would be nice to make
>> more components optional. Can we have something like this:
>>
>>       [[h*:]mm:]ss[.[d[c[m]]]  | s*[.d[c[m]]]
>>
>> Examples:
>>     23  = 23 seconds
>>     23.2  = 23 sec, 1 decisec
>>     1:23.45   = 1 min, 23 sec, 45 centisec
>>     123.456  = 123 sec, 456 millisec
>
> Currently the syntax is [h*:]mm:ss.sss; what's the advantage of making
> this more complicated? It's not like most subtitled clips will be shorter
> than a minute. Also, why would we want to support multiple redundant ways
> of expressing the same time? (e.g. 01:00.000 and 60.000)
>
> Readability of VTT files seems like it would be helped by consistency,
> which suggests using the same format everywhere, as much as possible.

Yup, fair enough.

Cheers,
Silvia.


More information about the whatwg mailing list