[whatwg] Timed tracks: feedback compendium

Wed Sep 8 02:27:35 PDT 2010

On Wed, 08 Sep 2010 01:19:17 +0200, Ian Hickson <ian at hixie.ch> wrote:

> On Fri, 23 Jul 2010, Philip Jägenstedt wrote:

>> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#attr-track-kind
>>
>> The distinction between subtitles and captions isn't terribly clear.
>>
>> It says that subtitles are translations, but plain transcriptions
>> without cues for the hard of hearing would also be subtitles.
>>
>> How does one categorize translations that are for the HoH?
>
> I've tried to clarify this.

Thanks, the new definitions look good to me.

>> Alternatively, might it not be better to simply use the voice "sound"
>> for this and let the default stylesheet hide those cues? When writing
>> subtitles I don't want the maintenance overhead of 2 different versions
>> that differ only by the inclusion of [doorbell rings] and similar.
>> Honestly, it's more likely that I just wouldn't bother with
>> accessibility for the HoH at all. If I could add it with <sound>doorbell
>> rings, it's far more likely I would do that, as long as it isn't
>> rendered by default. This is my preferred solution, then keeping only
>> one of kind=subtitles and kind=captions. Enabling the HoH-cues could
>> then be a global preference in the browser, or done from the context
>> menu of individual videos.
>
> I don't disagree with this, but I fear it might be too radical a step for
> the caption-authoring community to take at this point.

Well, I guess the infrastructure in place is enough to do this by changing  
stylesheets.

>> If we must have both kind=subtitles and kind=captions, then I'd suggest
>> making the default subtitles, as that is without a doubt the most common
>> kind of timed text. Making captions the default only means that most
>> timed text will be mislabeled as being appropriate for the HoH when it
>> is not.
>
> Ok, I've changed the default. However, I'm not fighting this battle if it
> comes up again, and will just change it back if people don't defend  
> having
> this as the default. (And then change it back again if the browsers pick
> "subtitles" in their implementations after all, of course.)
>
> Note that captions aren't just for users that are hard-of-hearing. Most  
> of
> the time when I use timed tracks, I want captions, because the reason I
> have them enabled is that I have the sound muted.

OK, thanks!

> On Fri, 23 Jul 2010, Sam Dutton wrote:
>>
>> Is trackgroup out of the spec?
>
> What is trackgroup?

In the discussion on public-html-a11y <trackgroup> was suggested to group  
together mutually exclusive tracks, so that enabling one automatically  
disables the others in the same trackgroup.

I guess it's up to the UA how to enable and disable <track>s now, but the  
only option is making them all mutually exclusive (as existing players do)  
or a weird kind of context menu where it's possible to enable and disable  
tracks completely independently. Neither options is great, but as a user I  
would almost certainly prefer all tracks being mutually exclusive and  
requiring scripts to enable several at once.

> On Fri, 6 Aug 2010, Philip Jägenstedt wrote:
>>
>> I'm not particularly fond of the current voice markup, mainly for 2
>> reasons:
>>
>> First, a cue can only have 1 voice, which makes it impossible to style
>> cues spoken/sung simultaneously by 2 or more voices. There's a karaoke
>> example of this in
>> <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_voices>
>
> That's just two cues.

I'm not sure what you're saying. The male singer's cues are in blue, the  
female singer's are in red and the part sung together is in green. Are you  
saying that the last cue should be made into two cues, or something else?

>> I would prefer if voices could be mixed, as such:
>>
>> 00:01.000 --> 00:02.000
>> <1> Speaker 1
>>
>> 00:03.000 --> 00:04.000
>> <2> Speaker 2
>>
>> 00:05.000 --> 00:06.000
>> <1><2> Speaker 1+2
>
> What's the use case?

To use a different style for the cues that are sung together, so that you  
know when it's your turn to sing. I hope we can throw away the numerical  
voices, continued below...

>> Second, it makes it impossible to target a smaller part of the cue for
>> styling. We have <i> and <b>, but there are also cases where part of the
>> cue should be in a different color, see
>> <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_colors>
>
> Well you can always restyle <i> or <b>.

That would be quite an abuse of <i> and <b> and would give bogus  
italics/bold text in standalone players.

>> If one allows multiple voices, it's not hard to predict that people will
>> start using magic numbers just to work around this, which would both be
>> wrong semantically and ugly to look at:
>>
>> 00:01.000 --> 00:02.000
>> <1> I like <1234>blue</1234> words.
>>
>> They'd then target 1234 with CSS to color it blue.
>>
>> I'm not sure of the best solution. I'd quite like the ability to use
>> arbitrary voices, e.g. to use the names/initials of the speaker rather
>> than a number, or to use e.g. <shouting> in combination with CSS :before
>> { content 'Shouting: ' } or similar to adapt the display for different
>> audiences (accessibility, basically).
>
> Yeah, there are some difficult-to-satisfy constraints here. On the one
> hand having a predefined set of voices leads to better semantics,
> usability for authors, and accessibility; on the other hand we need
> something open-ended because we can't think of everything. We also have  
> to
> make sure we don't enable voices to conflict with future tag names, so
> whatever we do that's open-ended would have to use a specific syntax  
> (like
> being all numbers, which is what I currenlty have). I'm not sure how to
> improve on what we have now, but it's certainly not perfect.
>
>
> On Wed, 11 Aug 2010, Philip Jägenstedt wrote:
>>
>> What should numerical voices be replaced with? Personally I'd much
>> rather write <philip> and <silvia> to mark up a conversation between us
>> two, as I think it'd be quite hard to keep track of the numbers if
>> editing subtitles with many different speakers.
>
> We could say that a custom voice has to start with some punctuation or
> other, say <:philip>?

Yes, that would be better than numerical voices IMO. Unless there's a very  
good reason for making voices always apply to the whole cue, could we not  
use the same parsing for voices and other tags (i, b, ruby, rt)?

Ideally, the CSS extensions  
(http://wiki.whatwg.org/wiki/Timed_tracks#CSS_extensions) should also work  
the same for voices and tags, using the normal child selectors would work.  
Something like video::cue(narrator > i) to style the following cue:

00:01.000 --> 00:02.000
<narrator><i>The story begins

I'm not sure what constraints CSS syntax puts on the prefix for custom  
voices, is : safe? Other options might be <@philip> (Twitter style) or  
<-philip> (vendor prefix style).

> On Tue, 24 Aug 2010, Philip Jägenstedt wrote:
>>
>> Here's the SRT research I promised:
>> http://blog.foolip.org/2010/08/20/srt-research/
>
> Awesome! Thanks for this.
>
> Addressing points in the same order:
>
>  - charset: resolved by introducing a charset override.

Oh well, that's better than sniffing the encoding or trusting Content-Type  
I guess.

>  - blank lines not separating cues: I couldn't find a client that
>    supported missing the blank line, so I didn't support that. It's a
>    small number of files, and a small number of cues within those files,
>    I presume, so I'm not too worried.

Indeed, I couldn't find one either, the players I tested instead rendered  
the timing line and following cue text together with the previous cue,  
just like a WebSRT implementation would. What we could do to slightly  
improve the situation is to make --> invalid in the cue text, so that  
validators could warn about this. That would require adding a > escape  
for >, so I'm not sure it's worth it. Perhaps validators could warn about  
it regardless of the spec.

>  - overlapping cues: supporting these is pretty important, so files with
>    overlapping cues will just have some weird artefects on playback.

OK, tools to fix SRT timings already exist, so I guess this is manageable.

> The remaining data is interesting but seems to be consistent with our
> expectations before WebSRT was specced.

Right.

> On Wed, 25 Aug 2010, Philip Jägenstedt wrote:
>>
>> "The tasks queued by the fetching algorithm on the networking task
>> source to process the data as it is being fetched must examine the
>> resource's Content Type metadata, once it is available, if it ever is.
>> If no Content Type metadata is ever available, or if the type is not
>> recognised as a timed track format, then the resource's format must be
>> assumed to be unsupported (this causes the load to fail, as described
>> below)."
>>
>> In other words, browsers should have a whitelist of supported text track
>> format, just like they should for audio and video formats. (Note though
>> that Safari and Chrome ignore the MIME type for audio/video and will
>> likely continue to do so.)
>>
>> It seems to that a side-effect of this is that it will be impossible to
>> test <track> on a local file system, as there's no MIME type and
>> browsers aren't allowed to sniff. Surely this can't be the intention,
>> Hixie?
>
> Local file systems generally use extensions to declare file types (at
> least, on Windows and Mac OS X).

> On Wed, 25 Aug 2010, Philip Jägenstedt wrote:
>>
>> The main reason to care about the MIME type is some kind of "doing the
>> right thing" by not letting people get away with misconfigured servers.
>> Sometimes I feel it's just a waste of everyone's time though, it would
>> generally be less work for both browsers and authors to not bother.
>
> Agreed. Not sure what to do for WebSRT though, since there's no good way
> to recognise a WebSRT file as opposed to some other format.

In a <track> context, ignoring Content-Type is certainly the simplest and  
removes the need to require any specific file extension for local use.  
Sniffing isn't really an issue since in a top-level context you can't do  
much of anything interesting with SRT except display it as text (which  
text/plain would achieve).

-- 
Philip Jägenstedt
Core Developer
Opera Software