[whatwg] Google Feedback on the HTML5 media a11y specifications

Tue Feb 15 02:09:19 PST 2011

On Tue, 15 Feb 2011 04:28:36 +0100, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> Hi Philip,
>
> On Tue, Feb 15, 2011 at 3:27 AM, Philip Jägenstedt <philipj at opera.com>  
> wrote:
>> On Wed, 09 Feb 2011 03:57:37 +0100, Silvia Pfeiffer
>> <silviapfeiffer1 at gmail.com> wrote:
>>
>>>>> A. Feedback on the WebVTT format
>>>>
>>>>> 1. Introduce file-wide metadata
>>>>>
>>>>> WebVTT requires a structure to add header-style metadata. We are here
>>>>> talking about lists of name-value pairs as typically in use for  
>>>>> header
>>>>> information. The metadata can be optional, but we need a defined  
>>>>> means
>>>>> of adding them.
>>>>>
>>>>> Required attributes in WebVTT files should be the main language in  
>>>>> use
>>>>> and the kind of data found in the WebVTT file - information that is
>>>>> currently provided in the <track> element by the @srclang and @kind
>>>>> attributes. These are necessary to allow the files to be interpreted
>>>>> correctly by non-browser applications, for transcoding or to  
>>>>> determine
>>>>> if a file was created as a caption file or something else, in
>>>>> particular the @kind=metadata. @srclang also sets the base
>>>>> directionality for BiDi calculations.
>>>>
>>>> Are there non-browsers that use the language for font-selection or  
>>>> bidi?
>>>> Is
>>>> auto-detection not likely to give a better user experience? Are there  
>>>> any
>>>> other use cases for knowing the language of the captions *after*  
>>>> they've
>>>> been opened?
>>>
>>>
>>> I can't see a different way to let non-browser applications know what
>>> font to choose, even how to provide the user with a menu of available
>>> caption tracks for a video, or to set the base directionality for
>>> BiDi. Also, language auto-detection is a huge burden to put onto
>>> non-browser applications. Having a readable language tag at the
>>> beginning of the file is useful to quickly figure it all out.
>>>
>>> The language set in <track> would certainly overrule what is in the
>>> file. Also, the last language attribute in the header would probably
>>> win.
>>>
>>> I guess it would also be ok to have language and kind optional -
>>> different applications may then default to interpreting WebVTT files
>>> differently, such as by default English and Captions - or English and
>>> Descriptions, but that's probably acceptable from context.
>>
>> Given that most existing subtitle formats don't have any language  
>> metadata,
>> I'm a bit skeptical. However, if implementors of non-browser players  
>> want to
>> implement WebVTT and ask for this I won't stand in the way (not that I  
>> could
>> if I wanted to). For simplicity, I'd prefer the language metadata from  
>> the
>> file to not have any effect on browsers though, even if no language is  
>> given
>> on <track>.
>
> There is also the Content-Language response header of HTTP, which
> could have an influence on the browser, too. I'm not sure about the
> best way to deal with all this overlapping information, but I'm sure
> it can be sorted out.

My preference is ignoring everything except what is given in <track>. In  
particular language can't be given in the resource or its headers, because  
then one has to fetch all the tracks in order to provide a track selection  
menu with language information or to automatically activate the suitable  
tracks.

>>>> Why do non-browser players need to know the kind? All kinds are  
>>>> processed
>>>> in
>>>> the same way except metadata, and there's no reason to use metadata
>>>> tracks
>>>> with external players.
>>>
>>> Maybe I have a different view of what applications will make use of
>>> WebVTT files than most. My thinking is that there will also be uses
>>> for metadata tracks in external applications. Aside from this, there
>>> will be authoring applications and players, yes, but there will also
>>> be automated processing tools. So, to know what type of content is
>>> inside a file without having to look at more than the file's headers
>>> is really important.
>>
>> For both of these cases, putting some magic strings inside comments  
>> that are
>> ignored by browsers sounds like it would be sufficient. Name-value  
>> metadata
>> that is ignored by browsers would be fine as well.
>
> I'm for the second option: name-value metadata that is ignored by the
> browser. I think in fact the browser should in general ignore all
> name-value metadata with the exception of file-wide cue settings.

I agree, browsers should ignore in-file metadata. (That's one reason I  
think using comments for it is quite fine most of the time.)

>>>>> Further metadata fields that are typically used by authors to keep
>>>>> specific authoring information or usage hints are necessary, too. As
>>>>> examples of current use see the format of MPlayer mpsub’s header
>>>>> metadata [2], EBU STL’s General Subtitle Information block [3], and
>>>>> even CEA-608’s Extended Data Service with its StartDate, Station,
>>>>> Program, Category and TVRating information [4]. Rather than  
>>>>> specifying
>>>>> a specific subset of potential fields we recommend to just have the
>>>>> means to provide name-value pairs and leave it to the negotiation
>>>>> between the author and the publisher which fields they expect of each
>>>>> other.
>>>>
>>>> This approach has worked very well with Vorbis Comments, probably  
>>>> mostly
>>>> because all interesting fields have been pre-defined in
>>>> http://www.xiph.org/vorbis/doc/v-comment.html
>>>>
>>>> For a web format though, wouldn't some kind of wiki registry be good  
>>>> to
>>>> avoid total mayhem, especially if there are some predefined fields?  
>>>> (Not
>>>> having file-wide metadata would also avoid such mayhem.)
>>>
>>> It might be good to define a base set - the Vorbis Comments one or the
>>> ID3 ones could be appropriate. Even the old Dublin Core set (the first
>>> ones, not the current chaos) could be good. I could also analyse the
>>> sets used in current typical caption formats and propose a superset of
>>> those.
>>>
>>> While I think you're right with suggesting a predefined set of fields,
>>> I am mostly keen right now to agree on the general format of the
>>> fields and how we need to parse them rather than what they actually
>>> are.
>>>
>>> So, I would suggest we allow lines of "name=value" under the WEBVTT
>>> magic string. A blank line defines the end of the header section and
>>> the beginning of the cues. Would be simple enough to parse, right?
>>
>> Sure, it's already handled by the current parsing spec, since it ignores
>> everything up to the first blank line.
>
> That's not quite how I'm reading the spec.
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#webvtt-0
> allows
> "Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER
> TABULATION (tab) character followed by any number of characters that
> are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)
> characters."
> after the "WEBVTT FILE" magic.
> To me that reads like all of the extra stuff has to be on the same line.
> I'd prefer if this read "any character except for two WebVTT line
> terminators", then it would all be ready for such header-style
> metadata.

See steps 12-17 of  
<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#parsing-0>,  
it just skips all lines up to the first blank line. Syntax and parsing are  
different :)

>>>>> 4. Cue formatting requirements
>>>>>
>>>>> In analysing the available cue formatting functionality, we have  
>>>>> found
>>>>> that some features are missing. Most of these features can be added
>>>>> through using CSS on cues that have received a <b>, <i>, <c> or <v>
>>>>> marker. The following features are core to traditional TV and exist  
>>>>> in
>>>>> EBU STL and CEA-608/708 captions. Support of these will be a core
>>>>> requirement for browsers as well as non-browser applications and it
>>>>> makes sense to add these to WebVTT rather than relying on external  
>>>>> CSS
>>>>> which cannot be used for non-browser captions:
>>>>
>>>> The unstated requirement here seems to be that WebVTT needs to work  
>>>> as an
>>>> interchange format for various TV captioning formats even in user  
>>>> agents
>>>> without any support for CSS (or JavaScript). I'm trying to not make a
>>>> straw
>>>> man argument, but if want an interchange format, we should pick TTML,
>>>> which
>>>> is explicitly designed to be just that and doesn't depend on CSS.
>>>>
>>>> Is it not enough that a lossy conversion can be made from various  
>>>> formats
>>>> into WebVTT+CSS(+JavaScript)? If not, the "Web" in "WebVTT" is highly
>>>> misleading...
>>>
>>>
>>> We're trying to avoid the need for multiple transcodings and are
>>> trying to achieve something like the following pipeline:
>>> broadcast captions -> transcode to WebVTT -> show in browser ->
>>> transcode to broadcast devices -> show
>>>
>>> If we have to plug TTML into this pipeline, too, it will be much
>>> slower and we would need to additionally define a mapping from TTML to
>>> WebVTT and back.
>>>
>>> I'm sure with SMPTE-TT around we will end up seeing things like
>>> broadcast->TTML->WebVTT->browser, but even then we don't want WebVTT
>>> to be a lossy format.
>>
>> I can only disagree. Trying to make WebVTT into an interchange format  
>> will
>> inevitably turn it into a highly presentational format with lots of  
>> legacy
>> baggage. I can certainly see the use cases for an interchange format,  
>> but I
>> don't think it's worth the added complexity. I'd prefer an approach  
>> where
>> any format quirks that can't be mapped to WebVTT are expressed using  
>> <c.foo>
>> and if it turns out lots of people want the feature, we can add it to a
>> future revision.
>
> I wouldn't go as far as to say it needs to become an interchange
> format. But I can see us specifying what the browser parses, while
> given options such as the header-metadata and span classes that allow
> with some extra information to fully recover the broadcast
> functionality. I actually think that is almost possible already.

After this thread has run for a while, it'd be nice to hear where you  
think <c.foo> isn't enough and new markup is needed, if anything.

>>>>> * underline: EBU STL, CEA-608 and CEA-708 support underlining of
>>>>> characters. The underline character is also particularly important  
>>>>> for
>>>>> some Asian languages. Please make it possible to provide text
>>>>> underlines without the use of CSS in WebVTT.
>>>>
>>>> Which Asian languages? If it's just the Chinese
>>>> <http://en.wikipedia.org/wiki/Proper_name_mark>, then I don't think  
>>>> that
>>>> needs <u> or similar. In my experience, use of the Chinese proper name
>>>> mark
>>>> is in fact extremely rare in Chinese captions, at least in movies and  
>>>> TV
>>>> series from the mainland and Taiwan. It would be best to use e.g.
>>>> 我來自<c.pnm>中國</c> to make it easy to change the style between
>>>> single/double/wavy/no underline.
>>>
>>> OK. So if we need underlined text, it will need to be
>>> <c.underline>..</c> and CSS underline? I guess in a Web context
>>> underline text is usually a hyperlink so it makes sense to discourage
>>> <u> for the Web. But is that also an argument for
>>> captions/subtitles/descriptions? What is the argument against using
>>> <u> in captions?
>>
>> I don't really have an argument against it, I just questioned that it is
>> important for Asian languages in particular. Adding <u> would be really
>> simple, it's just a question of why. I've seldom seen underlining in
>> captions, so it's not clear to me how it's usually used.
>
> I'm told <u> is fairly common in traditional captions. We don't do
> <c.italics> either for such common stuff.
> But if we really don't want this, I guess <c.u> would work, too and is
> not that much longer.

I can't see any underlining when scanning through the samples at  
<http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA>.  
If it is in fact common in some contexts, it'd be great to have samples  
added to the wiki, I'm sure we could learn something from it. If <u> is  
actually useful for something, then we should just add it.

>>> With "-" you are referring to replacing "-->" with "-" to arrive at  
>>> things
>>> like:
>>> 15.000-17.950
>>> At the left we can see...
>>>
>>> as compared to:
>>> 15.000+2.950
>>> At the left we can see...
>>
>> Yes, that's what I meant.
>>
>>> I actually think they read fairly given that people are used to the
>>> double meaning of "-": to mean both "from ... to" and "minus".
>>> But we could use a different character for "absolute time" if you
>>> prefer, e.g. "/".
>>> 15.000/17.950
>>> At the left we can see...
>>>
>>> I find this fairly readable, too.
>>
>> Either would work for me. As I mentioned, the room for improvement here
>> isn't only the syntax of the timing line, but also to make it obvious  
>> that
>> cue timestamps like <00:01.000> are relative. Using + for relative
>> timestamps is potentially confusing too, as one might think that many
>> consecutive <+00:01.000> are cumulative, rather than all being 1 second  
>> from
>> the start time of the cue.
>
> That's true and in fact the way in which I have authored my examples,
> now that I look back at them. It makes the timings smaller and I think
> it's a bit more logical. But really we just have to decide on one
> meaning:
>
> 5-10
> This <+1>is <+1>a <+1>simple <+1>example.
>
> I find I actually prefer this over
>
> 5-10
> This <+1>is <+2>a <+3>simple <+4>example.

Right, we just have to pick something. I'd like to get the basic structure  
down soon, though, as changing the timestamp parsing will be very  
difficult once there are implementations.

>>>>> 7. Comments
>>>>
>>>>> we recommend the introduction of comments.
>>>>
>>>> I agree and think it needs to happen before WebVTT starts to get
>>>> implemented
>>>> and used on the web. In other words: now.
>>>
>>> Agreed. I'm happy for the previously suggested "//" at the line start
>>> to be comments, or, for that matter, "#" or ";" or any other special
>>> character. I would prefer not to use "/*" since it implies a "*/" is
>>> required to end the comment. Similarly we should avoid "<!--" and
>>> "-->" or anything else that requires a special comment end mark and
>>> more than one or two characters.
>>
>> I'd quite like to have block comments, so I think the best options are:
>>
>> 1. // and /* */ like JavaScript
>> 2. <!-- --> like HTML/XML
>
> If the main use case for the comments is to comment out a line,
> something at the line start alone would be sufficient. If we have to
> have both, I would prefer the shorter first option.
>
>> I think that the main difficulty is actually not picking a syntax, but
>> deciding how it works in the parser. Unlike HTML, I don't think we want  
>> the
>> comments to show up in the "DOM", since that would only work for  
>> intra-cue
>> comments. Ideally it would be preprocessor-ish, but yet the magic bytes
>> ("WEBVTT FILE") should be checked first as otherwise identifying WebVTT
>> would require implementing its preprocessor steps :/
>
> As I would not want the comments not to be handed into the DOM or to
> JavaScript, it doesn't matter if they are not like HTML. I would
> regard them more as pre-processor style comments.

For simplicity, perhaps it would be better to have line-comments only. On  
my wishlist I have a less convoluted parser definition which operates on  
lines instead of sprinkling CR/LF all over, and it'd be easy to add  
line-comments to such a parser. Wish-list item requested at  
<http://www.w3.org/Bugs/Public/show_bug.cgi?id=12076>.

>>>>> 8. Line wrapping
>>>>>
>>>>> CEA-708 captions support automatic line wrapping in a more
>>>>> sophisticated way than WebVTT -- see
>>>>> http://en.wikipedia.org/wiki/CEA-708#Word_wrap.
>>>>>
>>>>> In our experience with YouTube we have found that in certain
>>>>> situations this type of automatic line wrapping is very useful.
>>>>> Captions that were authored for display in a full-screen video may
>>>>> contain too many words to be displayed fully within the actual video
>>>>> presentation (note that mobile / desktop / internet TV devices may
>>>>> each have a different amount of space available, and embedded videos
>>>>> may be of arbitrary sizes). Furthermore, user-selected fonts or font
>>>>> sizes may be larger than expected, especially for viewers who need
>>>>> larger print.
>>>>>
>>>>> WebVTT as currently specified wraps text at the edge of their
>>>>> containing blocks, regardless of the value of the 'white-space'
>>>>> property, even if doing so requires splitting a word where there is  
>>>>> no
>>>>> line breaking opportunity. This will tend to create poor quality
>>>>> captions.  For languages where it makes sense, line wrapping should
>>>>> only be possible at carriage return, space, or hyphen characters, but
>>>>> not on   characters.  (Note that CEA-708 also contains
>>>>> non-breaking space and non-breaking transparent space characters to
>>>>> help control wrapping.)However, this algorithm will not necessarily
>>>>> work for all languages.
>>>>>
>>>>> We therefore suggest that a better solution for line wrapping would  
>>>>> be
>>>>> to use the existing line wrapping algorithms of browsers, which are
>>>>> presumably already language-sensitive.
>>>>>
>>>>> [Note: the YouTube line wrapping algorithm goes even further by
>>>>> splitting single caption cues into multiple cues if there is too much
>>>>> text to reasonably fit within the area. YouTube then adjusts the  
>>>>> times
>>>>> of these caption cues so they appear sequentially.  Perhaps this  
>>>>> could
>>>>> be mentioned as another option for server-side tools.]
>>>>
>>>> Yeah, with SRT people are manually line-wrapping when authoring the
>>>> captions
>>>> and often enough the end result is that you get something rendered:
>>>>
>>>> - Who could have guessed that not all fonts are the same
>>>> size?
>>>> - That's news to me, so I get four lines of text where I
>>>> wanted two!
>>>>
>>>> I'm inclined to say that we should normalize all whitespace during
>>>> parsing
>>>> and not have explicit line breaks at all. If people really want two
>>>> lines,
>>>> they should use two cues. In practice, I don't know how well that  
>>>> would
>>>> fare, though. What other solutions are there?
>>>
>>> I don't think I would go that far. The concern has mostly been with
>>> the line wrapping of lines that are too long and the possibility of
>>> splitting words that way. The particular concern was with this
>>> paragraph:
>>>
>>> "Text runs must be wrapped at the edge of their containing blocks,
>>> regardless of the value of the 'white-space' property, even if doing
>>> so requires splitting a word where there is no line breaking
>>> opportunity."
>>> see
>>> http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#timed-text-tracks-0
>>>
>>> So we want to avoid splitting mid-word and we suggest introducing the
>>> ability to have non-breaking spaces.
>>
>> I think splitting in the middle of words would only happen for words  
>> that
>> are longer than the whole line.
>
> Ah ok - I guess you can interpret the sentence above in this way as
> in"splitting a word ONLY where there is no line breaking opportunity".
> Then it's probably ok. It would still make sense to accept
> non-breaking spaces.

Perhaps Hixie would like to clarify in the spec precisely what is meant?

There's already a non-breaking space in Unicode: NO-BREAK SPACE (U+00A0)

>> There's still plenty of room for improvements in line wrapping, though.  
>> It
>> seems to me that the main reason that people line wrap captions  
>> manually is
>> to avoid getting two lines of very different length, as that looks quite
>> unbalanced. There's no way to make that happen with CSS, and AFAIK it's  
>> not
>> done by the WebVTT rendering spec either.
>
> People split manually when they want quality captions and can visually
> test what it will look like.
>
> This endeavor has one big problem: when you change the video size,
> e.g. go to full screen, your optimisation for the previous size is
> likely to not be optimal for the new size any more. There, an
> automatic line balancing that makes use of commas and "and"s for
> choosing likely good line break positions would be nice.
>
> A completely different situation appears when the captions are not
> manually created, as is the case in YouTube. Even when you submit a
> perfect transcript and time-align it through speech recognition, you
> will only do the line breaks as you have to render cues. To achieve a
> better quality there, a better line-break algorithm would help
> massively.
>
> So, I agree with you about improving the line wrapping. I also think
> it is likely something that we have to leave to the browsers - at
> least for now.

Right, some experimentation here would be great, as I haven't seen any  
feature like this in any media players. In the hope of inspiring someone,  
perhaps myself, here's how I tentatively would like things to work:

1. Authors are encouraged to not manually line-break
2. UAs render the text at whatever with the <video> container allows, with  
margins and all
3. The text will have been rendered on n lines.
4. Decrease the width on the container as much as possible while having n  
lines.
5. Use that line-breaking and then do whatever left/center/right-alignment  
relative to the original width.

I really should get around to reading the rendering section for WebVTT to  
see what it actually does, perhaps it's already clever...

>>>>> 4. Addressing individual cues through CSS
>>>>>
>>>>> As far as we understand, you can currently address all cues through
>>>>> ::cue and you can address a cue part through ::cue-part(<voice> ||
>>>>> <part> || <position> || <future-compatibility>). However, if we
>>>>> understand correctly, it doesn’t seem to be possible to address an
>>>>> individual cue through CSS, even though cues have individual
>>>>> identifiers. This is either an oversight or a misunderstanding on our
>>>>> parts. Can you please clarify how it is possible to address an
>>>>> individual cue through CSS?
>>>>
>>>> Since I've been arguing against the id's in WebVTT, I'm curious about  
>>>> the
>>>> use case here. Isn't using a unique class good enough?
>>>
>>> This links in with the discussion above on CSS styling and classes.
>>> Rather than define classes of cue settings and reference them from the
>>> cues, this allows them to be applied to individual cues in style
>>> sheets. I thought the whole reason of cue identifiers was to have this
>>> addressing functionality, so this would just close the loop.
>>>
>>> For example:
>>>
>>> Style sheet of the Web page:
>>> <style>
>>> video track#t1 ::cue(cue10) {
>>>  text-decoration: blink;
>>> }
>>> </style>
>>>
>>> The Web page (extract):
>>> <video src="video.webm" controls>
>>>  <track id="t1" label="captions" kind="captions" srclang="en-US"
>>> src="cap1.vtt"/>
>>> </video>
>>>
>>> The caption file cap1.vtt:
>>> WEBVTT
>>> Language=en-US
>>> Kind=Captions
>>>
>>> cue1
>>> 0.000-5.000
>>> blab blah
>>>
>>> cue10
>>> 40.000-60.000
>>> ALERT: Your basement is flooding - evacuate!
>>>
>>>
>>> Cue10 is addressed through CSS and turned into a blinking text without
>>> a need to change the markup at all.
>>
>> My point was that you could just as well do this:
>>
>> 0.000-5.000
>> <c.cue1>blab blah</c>
>>
>> In my view of things, id's in HTML are primarily for addressing via
>> #fragments and as hooks for scripts, for styling class is quite  
>> sufficient,
>> so I'm thinking it would be for WebVTT as well.
>
> I quite like the idea of using the identifiers for named media
> fragment URIs: e.g. http://example.org/video.webm#cue10 . We need
> identifiers for this. Also, I find them less intrusive in the text
> than <c.cue1> which defines a class that is only every used on this
> single cue.

Hmm, isn't that what we have chapters for? Or do you want to use id's for  
some kind of inline chapters?

>>>>> 5. Ability to move captions out of the way
>>>>>
>>>>> Our experience with automated caption creation and positioning on
>>>>> YouTube indicates that it is almost impossible to always place the
>>>>> captions out of the way of where a user may be interested to look at.
>>>>> We therefore allow users to dynamically move the caption rendering
>>>>> area to a different viewport position to reveal what is underneath.  
>>>>> We
>>>>> recommend such drag-and-drop functionality also be made available for
>>>>> TimedTrack captions on the Web, especially when no specific
>>>>> positioning information is provided.
>>>>
>>>> This would indeed be rather nice, but wouldn't it interfere with text
>>>> selection? Detaching the captions into a floating, draggable window  
>>>> via
>>>> the
>>>> context menu would be a theoretically possible solution, but that's
>>>> getting
>>>> rather far ahead of ourselves before we have basic captioning support.
>>>
>>> On YouTube you can only move them within the video viewport. You
>>> should try it - it's really awesome actually.
>>>
>>> When you say "interfere with text selection" are you suggesting that
>>> the text of captions/subtitles should be able to be cut and pasted? I
>>> wonder what copyright holders think about that.
>>
>> Being able to select the captions just like any other text is a great  
>> thing
>> that I wouldn't want to disable. It's very useful if you want to pause  
>> and
>> look up the definition of a word or to report a typo in the captions  
>> without
>> having to retype the whole text.
>
> I guess you can have all of that as you can have it on Web pages, too.
> If you click and hold, it will be grabbing for moving. If you double
> click it is text selection for cut and paste. So, I don't think there
> would be a problem.

That would work, but I have to admit I've never seen a web page/browser  
combination that does what you suggest. Just single clicking and dragging  
is certainly the most discoverable form of text selection.

>> Premium Captions can be protected using the same tricks that are used to
>> prevent Premium DOM Text Nodes from being copied.
>
> Agreed.

-- 
Philip Jägenstedt
Core Developer
Opera Software