[whatwg] Video, Closed Captions, and Audio Description Tracks
hsivonen at iki.fi
Mon Oct 8 02:22:43 PDT 2007
(Heavy quote snipping. Picking on particular points.)
On Oct 8, 2007, at 03:14, Silvia Pfeiffer wrote:
> This is both, more generic than captions, and less generic in that
> captions have formatting and are displayed in a particular way.
I think we should avoid overdoing captioning or subtitling by
engineering excessive formatting. If we consider how subtitling works
with legacy channels (TV and movie theaters), the text is always in
the same sans-serif font with white fill and black outline located at
the bottom of the video frame (optionally located at the top when
there's relevant native text at the bottom and optionally italicized).
To get feature parity with the legacy that is "good enough", the only
formatting option you need is putting the text at the top of the
video frame as opposed to the bottom and optionally italicizing text
(It follows that I think the idea of using SVG for captioning or
subtitles is excessive.)
I wouldn't mind an upgrade path that allowed CSS font properties for
captioning and subtitles, but I think we shouldn't let formatting
hold back the first iteration.
> (colours, alignment etc. - the things that the EBU
> subtitling standard http://www.limeboy.com/support.php?kbID=12 is
The EBU format seems severely legacy from the Unicode point of view. :-(
> Another option would be to disregard CMML completely and invent a new
> timed text logical bitstream for Ogg which would just have the
> subtitles. This could use any existing time text format and would just
> require a bitstream mapping for Ogg, which should not be hard to do at
Is 3GPP Timed Text aka. MPEG-4 part 17 unencumbered? (IANAL, this
isn't an endorsement of the format--just a question.)
> an alternate audio track (e.g. speex as suggested by you for
> accessibility to blind people),
My understanding is that at least conceptually an audio description
track is *supplementary* to the normal sound track. Could someone who
knows more about the production of audio descriptions, please,
comment if audio description can in practice be implemented as a
supplementary sound track that plays concurrently with the main sound
track (in that case Speex would be appropriate) or whether the main
sound must be manually mixed differently when description is present?
> and several caption tracks (for different languages),
I think it needs emphasizing that captioning (for the deaf) and
translation subtitling (for people who can hear but who can't follow
the language) are distinctly differently in terms of the metadata
flagging needs and the playback defaults. Moreover, although
translations for multiple languages are nice to have, they complicate
UI and metadata considerably and packaging multiple translations in
one file is outside the scope of HTML5 as far as the current Design
Principles draft (from the W3C side) goes.
I think we should first focus on two kinds on qualitatively different
timed text (differing in metadata and playback defaults):
1) Captions for the deaf:
* Written in the same language as the speech content of the video
* May have speaker identification text.
* May indicate other relevant sounds textually.
* Don't indicate text that can be seen in the video frame.
* Not rendered by default.
* Enabled by a browser-wide "I am deaf or my device doesn't do
sound out" pref.
2) Subtitles for the people who can't follow foreign-language speech:
* Written in the language of the site that embeds video when
there's speech in another language.
* Don't identify the speaker.
* Don't identify sounds.
* Translate relevant text visible in the video frame.
* Rendered by default.
* As a bonus suppressible via the context menu or something on a
When the problem is frame this way, the language of the text track
doesn't need to be specified at all. In case #1 it is "same as
audio". In case #2 it is "same as context site". This makes the text
track selection mechanism super-simple.
Note that #2 isn't an accessibility feature but addressing #2 right
away avoids the abuse of the #1 feature which is for accessibility.
> I think we need to understand exactly what we expect from the caption
> tracks before being able to suggest an optimal solution. If e.g. we
> want caption tracks with hyperlinks on a temporal basis and some more
> metadata around that which is machine readable, then an extension of
> CMML would make the most sense.
I would prefer Unicode data over bitmaps in order to allow captioning
to be mined by search engines without OCR. In terms of defining the
problem space and metadata modeling, I think we should aim for the
two cases I outlined above instead of trying to cover more ground up
Personally, I'd be fine with a format with these features:
* Metadata flag that tells if the text track is captioning for the
deaf or translation subtitles.
* Sequence of plain-text Unicode strings (incl. forced line breaks
and bidi marks) with the following data:
- Time code when the string appears.
- Time code when the string disappears.
- Flag for positioning the string at the top of the frame instead
* A way to do italics (or other emphasis for scripts for which
italics is not applicable), but I think this feature isn't essential.
* A guideline for estimating the amount of text appropriate to be
shown at one time and a matching rendering guideline for UAs. (This
guideline should result in an amount of text that agrees with current
TV best practices.)
It would be up to the UA to render the text at the bottom of the
video frame in white sans-serif with black outline.
I think it would be inappropriate to put hyperlinks in captioning for
the deaf because it would venture outside the space of accessibility
and effectively hide some links for the non-deaf audience.
hsivonen at iki.fi
More information about the whatwg