[whatwg] clarification on how to do extended text descriptions for video

Tue May 24 03:45:43 PDT 2011

Hi Ian, all,

Ian and I had a brief conversation recently where I mentioned a
problem with extended text descriptions with screen readers (and worse
still with braille devices) and the suggestion was that the "paused
for user interaction" state of a media element may be the solution. I
would like to pick this up and discuss in detail how that would work
to confirm my sketchy understanding.

*The use case:*

In the specification for media elements we have a <track> kind of
"descriptions", which are:
"Textual descriptions of the video component of the media resource,
intended for audio synthesis when the visual component is unavailable
(e.g. because the user is interacting with the application without a
screen while driving, or because the user is blind). Synthesized as a
separate audio track."

I'm for now assuming that the synthesis will be done through a screen
reader and not through the browser itself, thus making the
descriptions available to users as synthesized audio or as braille if
the screen reader is set up for a braille device.

The textual descriptions are provided as chunks of text with a start
and a end time (so-called "cues"). The cues are processed during video
playback as the video's playback time starts to fall within the time
frame of the cue. Thus, it is expected the that cues are consumed
during the cue's time frame and are not present any more when the end
time of the cue is reached, so they don't conflict with the video's
normal audio.

However, on many occasions, it is not possible to consume the cue text
in the given time frame. In particular not in the following
situations:

1. The screen reader takes longer to read out the cue text than the
cue's time frame provides for. This is particularly the case with long
cue text, but also when the screen reader's reading rate is slower
than what the author of the cue text expected.

2. The braille device is used for reading. Since reading braille is
much slower than listening to read-out text, the cue time frame will
invariably be too short.

3. The user seeked right into the middle of a cue and thus the time
frame that is available for reading out the cue text is shorter than
the cue author calculated with.

*The requirement:*

Correct me if I'm wrong, but it seems that what we need is a way for
the screen reader to pause the video element from continuing to play
while the screen reader is still busy delivering the cue text. (In
a11y talk: what is required is a means to deal with "extended
descriptions", which extend the timeline of the video.) Once it's
finished presenting, it can resume the video element's playback.

*Can we use "paused for user interaction" for extended descriptions?*

As mentioned above: I want to find out if the "paused for user
interaction" state of the video element might allow us to solve this
problem.

IIUC, a video is "paused for user interaction" basically when the UA
has decided to pause the video without the user asking to pause it
(i.e. the paused attribute is false) and the pausing happened not for
network buffering reasons, but for other reasons. IIUC one concrete
situation where this state is used is when the UA has reached the end
of the resource and is waiting for more data to come (e.g. on a live
stream).

To use "paused for user interaction" for extending descriptions, we
need to introduce a means for the screen reader to tell the UA to
pause the video when it reaches the end of the cue and it's still busy
delivering a cue's text. Then, as it finishes, it will un-pause the
video to let it continue playing.

To me it sounds like a feasible solution.

The screen reader could even provide a user setting and a short-cut so
a user can decide that they don't want this pausing to happen or that
they want to move on from the current cue.

Another advantage of this approach is that e.g. a deaf-blind user
could hook up their braille device such that it will deliver the
extended descriptions and also deliver captions through braille with
such extension pausing happening. (Not sure that such a user would
even want to play the video, but it would be possible.)

Now, I think there is one problem though (at least as far as I can
tell). Right now, IIUC, screen readers are only passive listeners on
the UA. They don't influence the behaviour of the UA. The
accessibility API is basically only a one-way street from the UA to
the AT. I wonder if that is a major inhibitor of using this approach
or whether it's easy for UAs to overcome this limitation? (Or if such
a limitation even exists - I don't know enough about how AT work...).

Is that an issue? Are there other issues that I have overlooked?

Cheers,
Silvia.