[whatwg] WebVTT feedback (was Re: Video feedback)

Wed Jun 8 03:20:22 PDT 2011

On Wed, Jun 8, 2011 at 6:39 PM, Philip Jägenstedt <philipj at opera.com> wrote:
> On Wed, 08 Jun 2011 02:54:45 +0200, Silvia Pfeiffer
> <silviapfeiffer1 at gmail.com> wrote:
>
>> Hi Philip, all,
>>
>> On Tue, Jun 7, 2011 at 8:12 PM, Philip Jägenstedt <philipj at opera.com>
>> wrote:
>>>
>>> On Sat, 04 Jun 2011 17:05:55 +0200, Silvia Pfeiffer
>>> <silviapfeiffer1 at gmail.com> wrote:
>>>
>>>>> On Mon, 3 Jan 2011, Philip J盲genstedt wrote:
>>>
>>> Silvia, is your mail client a bit funny with character encodings? (The
>>> UTF-8
>>> representation of U+00E4 is the same as the GBK representation of
>>> U+76F2.)
>>
>> I'm using GMAIL, so if there is anything wrong, you'll have to report
>> it to Google. ;-)
>> Checking back, I actually received your name in Ian's email with that
>> funny encoding. I'm not sure it's gmail's fault for interpreting it in
>> this way or whether there was some information in email headers lost
>> during delivery or what else.
>>
>>
>>>>>> > > * The "bad cue" handling is stricter than it should be. After
>>>>>> > > collecting an id, the next line must be a timestamp line.
>>>>>> > > Otherwise,
>>>>>> > > we skip everything until a blank line, so in the following the
>>>>>> > > parser would jump to "bad cue" on line "2" and skip the whole cue.
>>>>>> > >
>>>>>> > > 1
>>>>>> > > 2
>>>>>> > > 00:00:00.000 --> 00:00:01.000
>>>>>> > > Bla
>>>>>> > >
>>>>>> > > This doesn't match what most existing SRT parsers do, as they
>>>>>> > > simply
>>>>>> > > look for timing lines and ignore everything else. If we really
>>>>>> > > need
>>>>>> > > to collect the id instead of ignoring it like everyone else, this
>>>>>> > > should be more robust, so that a valid timing line always begins a
>>>>>> > > new cue. Personally, I'd prefer if it is simply ignored and that
>>>>>> > > we
>>>>>> > > use some form of in-cue markup for styling hooks.
>>>>>> >
>>>>>> > The IDs are useful for referencing cues from script, so I haven't
>>>>>> > removed them. I've also left the parsing as is for when neither the
>>>>>> > first nor second line is a timing line, since that gives us a lot of
>>>>>> > headroom for future extensions (we can do anything so long as the
>>>>>> > second line doesn't start with a timestamp and "-->" and another
>>>>>> > timestamp).
>>>>>>
>>>>>> In the case of feeding future extensions to current parsers, it's way
>>>>>> better fallback behavior to simply ignore the unrecognized second line
>>>>>> than to discard the entire cue. The current behavior seems
>>>>>> unnecessarily
>>>>>> strict and makes the parser more complicated than it needs to be. My
>>>>>> preference is just ignore anything preceding the timing line, but even
>>>>>> if we must have IDs it can still be made simpler and more robust than
>>>>>> what is currently spec'ed.
>>>>>
>>>>> If we just ignore content until we hit a line that happens to look like
>>>>> a
>>>>> timing line, then we are much more constrained in what we can do in the
>>>>> future. For example, we couldn't introduce a "comment block" syntax,
>>>>> since
>>>>> any comment containing a timing line wouldn't be ignored. On the other
>>>>> hand if we keep the syntax as it is now, we can introduce a comment
>>>>> block
>>>>> just by having its first line include a "-->" but not have it match the
>>>>> timestamp syntax, e.g. by having it be "--> COMMENT" or some such.
>>>>>
>>>>> Looking at the parser more closely, I don't really see how doing
>>>>> anything
>>>>> more complex than skipping the block entirely would be simpler than
>>>>> what
>>>>> we have now, anyway.
>>>>
>>>> Yes, I think that can work. The pattern of a line with "-->" without
>>>> time markers is currently ignored, so we can introduce something with
>>>> it for special content like comments, style and default.
>>>
>>> This seems to have been Ian's assumption, but it's not what the spec
>>> says.
>>> Follow the steps in
>>>
>>> http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#parsing-0
>>>
>>> 32. If line contains the three-character substring "-->" (U+002D
>>> HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then jump
>>> to
>>> the step labeled timings below.
>>>
>>> 40. Timings: Collect WebVTT cue timings and settings from line, using cue
>>> for the results. If that fails, jump to the step labeled bad cue.
>>>
>>> 54. Bad cue: Discard cue.
>>>
>>> (Followed by a loop to skip until the next empty line.)
>>>
>>> The effect is that that any line containing "-->" that is not a timing
>>> line
>>> causes everything up to the next newline to be ignored.
>>
>>
>> Yes, that's what I expect. Therefore we can create such cues in the
>> file format right now and the browsers as they currently work will
>> ignore such content. In future, they can be extended to actually do
>> something sensible with it. Isn't that what "is currently ignored"
>> means? It doesn't break the parser - the parser just skips over it. Am
>> I missing something?
>
> OK, I guess we're talking about slightly different things. It is possible to
> add a syntax to comment out entire cues using something with "-->", so if
> that's all we want, that's fine.
>
>>>>>> * Voice synthesis of e.g. mixed English/French captions. Given that
>>>>>> this
>>>>>> would only be useful to be people who know both languages, it seem not
>>>>>> worth complicating the format for.
>>>>>
>>>>> Agreed on all fronts.
>>>>
>>>> I disagree with the third case. Many people speak more than one
>>>> language and even if they don't speak the language that is in use in a
>>>> cue, it is still bad to render it in using the wrong language model,
>>>> in particular if it is rendered by a screen reader. We really need a
>>>> mechanism to attach a language marker to a cue segment.
>>>
>>> It's not needed for the rendering of French vs English, is it? It is
>>> theoretically useful for CJK, but as I've said before it seems to be more
>>> common to transliterate the foreign script in these cases.
>>
>> I think it is needed. Think about a screen reader reading out that
>> text. Unless the screen reader knows which language model to load -
>> and that it now has to change from a French voicing to an English
>> voicing - it cannot present it accurately.
>>
>> Language markers are not just a matter of choosing the right character
>> encoding, but also the right TTS model.
>>
>>
>>>>>> Do you have any examples of real-world subtitles/captions that would
>>>>>> benefit from more fine-grained language information?
>>>>>
>>>>> This kind of information would indeed be useful.
>>>>
>>>> Note that I'm not so much worried about captions and subtitles here,
>>>> but rather worried about audio descriptions as rendered from cue text
>>>> descriptions.
>>>
>>> When would one want these descriptions to be multi-language?
>>
>> When they are describing something that is inherently multi-cultural.
>> For example, the name of a restaurant which is in French, while the
>> describer language is English.
>
> Does this kind of thing currently work with screen readers? Non-French
> people speaking English don't switch to proper French pronunciation when
> saying something like "I'm really into film noir" or "The general assumed
> political power through a coup d'etat", so screen reader users actually want
> what? If one doesn't know French, it seems like it would be harder to
> understand.

I can only go by the feedback that I got from Janina and she said she
is thoroughly upset that screen readers are given text without
language markers and can only read them in one language rather than
with the correct pronunciation of the right language.

Of course, if an author doesn't want to change the language model,
they would not add a <lang> tag around the text. But if they want to
change the language model, there is currently no way to do so.

> For languages further removed from English I'm fairly certain no English
> speaker would want to hear the original pronunciation. Imagine pronouncing
> "Mexico" in Spanish or "Beijing" in Mandarin Chinese in the middle of an
> English text... I'm certain it would confuse people more than help them
> understand.

It really depends. I'd rather give the author the option. Right now
they have no choice.

Regards,
Silvia.