[whatwg] <video><overlay> for captions/subtitles/etc

Mon Nov 30 11:54:51 PST 2009

On Sun, 29 Nov 2009 12:42:13 +0100, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> Philip, all,
>
>
> On Sun, Nov 29, 2009 at 9:37 PM, Philip Jägenstedt <philipj at opera.com>  
> wrote:
>> On Sun, 29 Nov 2009 06:21:45 +0100, Silvia Pfeiffer
>> <silviapfeiffer1 at gmail.com> wrote:
>>> My <itext> wasn't supposed to stay a JavaScript implementation. In
>>> fact, it had the exact same purpose as your <ovelay> proposal: to
>>> eventually be added into the HTML5 specification and be properly
>>> integrated, such that it didn't have to rely on the timeupdate.
>>> In fact, the <itextlist>/<itext> proposal, which was my second
>>> improvement, see
>>> https://wiki.mozilla.org/Accessibility/HTML5_captions_v2, doesn't look
>>> very different to what you have there.
>>>
>>
>> Yes, that is very clear, I used it only as an example of what needs to  
>> be
>> done to parse SRT with JavaScript. Go ahead and edit the wiki if there's
>> anything that makes it sounds like <itext> is something it is not.
>
> I guess what I was just missing is mention of what your proposal
> provides on top of what I had. You're stating that further down in
> your email, so it might be good to mention that. It also shows we are
> making progress. :-)

Added a "diff" statement to the wiki.

>>> I think you've taken the next step with proposing to add a wrapping
>>> <div> into the DOM - something I wasn't quite sure would be possible
>>> and I'm glad you've taken the step.
>>>
>>> Another comment on naming: whether we name the elements <itextlist>
>>> and <itext> or alternatively <overlay> and <source>, I'm not too
>>> fussed. In fact, I've discussed the renaming/reuse of <source> for
>>> <itext> in my recent blog post at
>>>
>>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>>> . I think it may well make a lot of sense since we can reduce the key
>>> required attributes to the ones that already exist for the <source>
>>> element.
>>>
>>
>> Indeed, my proposal is mainly a remix of <itext> and cue ranges. The  
>> main
>> selling point, though, is a consistent markup and DOM for in-band,  
>> external
>> and script-created subtitles and a hook to content into the fullscreen  
>> mode.
>
> These are where we are indeed making progress - excellent!
>
>
> I must admit, I am still a bit dubious about how you are proposing to
> deal with in-band captions. Is a UA expected to take them out of the
> file and directly render them into <overlay>? Then you don't get the
> kind of control you get as a Web author over external captions, e.g.
> to specify a media query.

The UA certainly has to parse and render the in-band captions some way, I  
was just trying to find a way to apply styling to them.

> Also, the user doesn't get exposed to the tracks that are available,
> so he/she could choose interactively. I have been told that such
> interactive choice of the to-be-displayed caption track is a
> requirement, since people may use the subtitles/captions to learn a
> new language or read in their actual native language. YouTube
> certainly exposes all the available alternative language tracks - also
> because some of these tracks are actually created on the fly by
> automated translation. These are some of the reasons I was asked to
> provide declarative markup of all of the available subtitle tracks of
> video, no matter whether they came out of the media file (in-line) or
> not.

Could the people who have given you these requirements possibly join the  
WHATWG and/or W3C HTML a11y TF to explain these use cases? AFAICT, no  
declarative markup is needed to be able to select between caption tracks,  
it can be done either via a native context menu or using script assuming  
that we have an API for exposing the available tracks (which is needed for  
multiple audio and video tracks too).

> So, maybe we can use <source> to not just point at further external
> subtitle tracks, but also at in-band subtitle tracks and thus really
> make in-band identical to out-of-band? We could even use Media
> Fragment URI addressing for such an approach, e.g.
>
> <source src="captions-english.srt" lang="en"></source>
> <source src="video.ogv?track=subtitle[de]" lang="de"></source>
>
> or alternatively if no file was given in the @src attribute of a
> <source> element, it would be clear that it pointed a track in the
> original media file like so:
>
> <source lang="de"></source>

Using the query string syntax not possible as query string are completely  
opaque to the client, but the fragment variant seems OK if a bit verbose  
(part of the URL is repeated). However, what happens if an author does  
this:

<video src="video.ogv">
   <source src="captions-english.srt" lang="en"></source>
   <source src="other-video.ogv#track=subtitle[de]" lang="de"></source>
</video>

Authors have no apparent reason to think this would not work, but an  
implementation that supports it is very, very unlikely to happen. UAs  
which don't understand the MF syntax would presumably download  
other-video.ogv and try decoding it as whatever subtitle formats it  
supports (e.g. SRT).

Perhaps some CSS selector to style in-band captions/subtitles after all?

> About the cue ranges:
>
> If I understand your approach, then it means that if the video ends up
> playing at a time that is between a registered cue range's start and
> end time, the given DOMString text would be added to the <overlay>
> element and displayed. Is this correct?

Not exactly, registering a cue range would simply cause an event to be  
fired at a particular time, the event handler must then do its thing to  
display subtitles or silly mustache overlays. I was intending to let  
external subtitles also fire these events, but after being challenged on  
the use case [1] I don't know if I see why.

> Would it not be better to register onenter and onleave functions that
> could do anything to the DOM, rather than restrict the cue's effect on
> the <overlay> part of the DOM? Maybe the slides that I want to show
> should be presented in a different <div> on the page and not as an
> overlay on the video? I must admit I am not quite sure about the best
> approach to solving cue ranges - still trying to figure out all the
> requirements.

This is what I was aiming for. Basically it's the same as the old cue  
ranges API but without pauseOnExit (use media fragments instead). The  
question is if external captions or some declarative syntax should fire  
the same events and if that requires adding some more parameters to the  
events, but let's continue this discussion on public-html-a11y.

[1] http://lists.w3.org/Archives/Public/public-html-a11y/2009Nov/0119.html

-- 
Philip Jägenstedt
Core Developer
Opera Software