[whatwg] Video, Closed Captions, and Audio Description Tracks

Mon Oct 8 02:22:43 PDT 2007

(Heavy quote snipping. Picking on particular points.)

On Oct 8, 2007, at 03:14, Silvia Pfeiffer wrote:

> This is both, more generic than captions, and less generic in that  
> captions have formatting and are displayed in a particular way.

I think we should avoid overdoing captioning or subtitling by  
engineering excessive formatting. If we consider how subtitling works  
with legacy channels (TV and movie theaters), the text is always in  
the same sans-serif font with white fill and black outline located at  
the bottom of the video frame (optionally located at the top when  
there's relevant native text at the bottom and optionally italicized).

To get feature parity with the legacy that is "good enough", the only  
formatting option you need is putting the text at the top of the  
video frame as opposed to the bottom and optionally italicizing text  
runs.

(It follows that I think the idea of using SVG for captioning or  
subtitles is excessive.)

I wouldn't mind an upgrade path that allowed CSS font properties for  
captioning and subtitles, but I think we shouldn't let formatting  
hold back the first iteration.

> (colours, alignment etc. - the things that the EBU
> subtitling standard http://www.limeboy.com/support.php?kbID=12 is
> providing).

The EBU format seems severely legacy from the Unicode point of view. :-(

> Another option would be to disregard CMML completely and invent a new
> timed text logical bitstream for Ogg which would just have the
> subtitles. This could use any existing time text format and would just
> require a bitstream mapping for Ogg, which should not be hard to do at
> all.

Is 3GPP Timed Text aka. MPEG-4 part 17 unencumbered? (IANAL, this  
isn't an endorsement of the format--just a question.)

> an alternate audio track (e.g. speex as suggested by you for  
> accessibility to blind people),

My understanding is that at least conceptually an audio description  
track is *supplementary* to the normal sound track. Could someone who  
knows more about the production of audio descriptions, please,  
comment if audio description can in practice be implemented as a  
supplementary sound track that plays concurrently with the main sound  
track (in that case Speex would be appropriate) or whether the main  
sound must be manually mixed differently when description is present?

> and several caption tracks (for different languages),

I think it needs emphasizing that captioning (for the deaf) and  
translation subtitling (for people who can hear but who can't follow  
the language) are distinctly differently in terms of the metadata  
flagging needs and the playback defaults. Moreover, although  
translations for multiple languages are nice to have, they complicate  
UI and metadata considerably and packaging multiple translations in  
one file is outside the scope of HTML5 as far as the current Design  
Principles draft (from the W3C side) goes.

I think we should first focus on two kinds on qualitatively different  
timed text (differing in metadata and playback defaults):
  1) Captions for the deaf:
   * Written in the same language as the speech content of the video  
is spoken.
   * May have speaker identification text.
   * May indicate other relevant sounds textually.
   * Don't indicate text that can be seen in the video frame.
   * Not rendered by default.
   * Enabled by a browser-wide "I am deaf or my device doesn't do  
sound out" pref.
  2) Subtitles for the people who can't follow foreign-language speech:
   * Written in the language of the site that embeds video when  
there's speech in another language.
   * Don't identify the speaker.
   * Don't identify sounds.
   * Translate relevant text visible in the video frame.
   * Rendered by default.
   * As a bonus suppressible via the context menu or something on a  
case-by-case basis.

When the problem is frame this way, the language of the text track  
doesn't need to be specified at all. In case #1 it is "same as  
audio". In case #2 it is "same as context site". This makes the text  
track selection mechanism super-simple.

Note that #2 isn't an accessibility feature but addressing #2 right  
away avoids the abuse of the #1 feature which is for accessibility.

> I think we need to understand exactly what we expect from the caption
> tracks before being able to suggest an optimal solution. If e.g. we
> want caption tracks with hyperlinks on a temporal basis and some more
> metadata around that which is machine readable, then an extension of
> CMML would make the most sense.

I would prefer Unicode data over bitmaps in order to allow captioning  
to be mined by search engines without OCR. In terms of defining the  
problem space and metadata modeling, I think we should aim for the  
two cases I outlined above instead of trying to cover more ground up  
front.

Personally, I'd be fine with a format with these features:
  * Metadata flag that tells if the text track is captioning for the  
deaf or translation subtitles.
  * Sequence of plain-text Unicode strings (incl. forced line breaks  
and bidi marks) with the following data:
    - Time code when the string appears.
    - Time code when the string disappears.
    - Flag for positioning the string at the top of the frame instead  
of bottom.
  * A way to do italics (or other emphasis for scripts for which  
italics is not applicable), but I think this feature isn't essential.
  * A guideline for estimating the amount of text appropriate to be  
shown at one time and a matching rendering guideline for UAs. (This  
guideline should result in an amount of text that agrees with current  
TV best practices.)

It would be up to the UA to render the text at the bottom of the  
video frame in white sans-serif with black outline.

I think it would be inappropriate to put hyperlinks in captioning for  
the deaf because it would venture outside the space of accessibility  
and effectively hide some links for the non-deaf audience.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/