[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Mon Aug 9 07:04:49 PDT 2010

On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> Hi Philip,
>
> On Sat, Aug 7, 2010 at 1:50 AM, Philip Jägenstedt <philipj at opera.com>  
> wrote:
>
>> * there is a possibility to provide script that just affects the
>>> time-synchronized text resource
>>>
>>
>> I agree that some metadata would be useful, more on that below. I'm not
>> sure why we would want to run scripts inside the text document, though,  
>> when
>> that can be accomplished by using the TimedTrack API from the containing
>> page.
>
>
>
> Scripts inside a timed text document would only be useful for  
> applications
> that use the track not in conjunction with a Web page.

Do you mean that media players could include a JavaScript engine just for  
supporting scripts in WebSRT? Not to say that it can't happen, but it  
seems a bit unlikely.

>> 2. There is a natural mapping of WebSRT into in-band text tracks.
>>> Each cue naturally maps into a encoding page (just like a WMML cue  
>>> does,
>>> too). But in WebSRT, because the setup information is not brought in a
>>> hierarchical element surrounding all cues, it is easier to just chuck
>>> anything that comes before the first cue into an encoding header page.  
>>> For
>>> WMML, this problem can be solved, but it is less natural.
>>>
>>
>> I really like the idea of letting everything before the first timestamp  
>> in
>> WebSRT be interpreted as the header. I'd want to use it like this:
>>
>> # author: Fan Subber
>> # voices: <1> Boy
>> #         <2> Girl
>>
>> 01:23:45.678 --> 01:23:46.789
>> <1> Hello
>>
>> 01:23:48.910 --> 01:23:49.101
>> <2> Hello
>>
>> It's not critical that the format of the header be machine-readable,  
>> but we
>> could of course make up a key-value syntax, use JSON, or something else.
>
>
>
> I disagree. I think it's absolutely necessary that the format of the  
> header
> be machine-readable. Just like EXIF in images is machine readable or ID3  
> in
> MP3 is machine-readable. It would be counter-productive not to have it
> machine-readable, in particular useless to archiving and media management
> solutions.

OK, so maybe key-values?

Author: Fan Subber
Voice: <1> Boy
Voice: <2> Girl

01:23:45.678 --> 01:23:46.789
<1> Hello

This looks a bit like HTTP headers. (I'm not sure I'd actually want to  
allow multiple occurrences of the same key, in practice that seems to  
result in inconsistencies in how people mark up multiple authors.)

>> I'm not sure of the best solution. I'd quite like the ability to use
>> arbitrary voices, e.g. to use the names/initials of the speaker rather  
>> than
>> a number, or to use e.g. <shouting> in combination with CSS :before {
>> content 'Shouting: ' } or similar to adapt the display for different
>> audiences (accessibility, basically).
>
>
>
> I agree. I think we can go back to using<span> and @class and @id and  
> that
> would solve it all.

I guess this is in support of Henri's proposal of parsing the cue using  
the HTML fragment parser (same as innerHTML)? That would be easy to  
implement, but how do we then mark up speakers? Using <span  
class="narrator"></span> around each cue is very verbose. HTML isn't very  
good for marking up dialog, which is quite a limitation when dealing with  
subtitles...

>>> * there is no language specification for a WebSRT resource; while this
>>> will
>>> not be a problem when used in conjunction with a <track> element, it  
>>> still
>>> is a problem when the resource is used just by itself, in particular  
>>> as a
>>> hint for font selection and speech synthesis.
>>>
>>
>> The language inside the WebSRT file wouldn't end up being used for  
>> anything
>> by a browser, as it needs to know the language before downloading it to  
>> know
>> whether or not to download it at all. Still, I'd like a header section  
>> in
>> WebSRT. I think the parser is already defined so that it would ignore
>> garbage before the first cue, so this is more a matter of making it  
>> legal
>> syntax.
>
>
> Not quite. Some metadata in the header can make sense to also expose to  
> the
> Web page.
>
> I agree that we need a structured header section in WebSRT.

Fair enough, we should revisit this when deciding on how to expose  
metadata in media resources in general.

>>  * there is no means to identify which parser is required in the cues  
>> (is
>>> it
>>> "plain text", "minimal markup", or "anything"?) and therefore it is not
>>> possible for an application to know how it should parse the cues.
>>>
>>
>> All the types that are actually for visual rendering are parsed in the  
>> same
>> way, aren't they? Of course there's no way for non-browsers to know that
>> metadata tracks aren't interesting to look at as subtitles, but I think
>> showing the user the garbage is a quicker to communicate that the file  
>> isn't
>> for direct viewing than hiding the text or similar.
>
>
>
> The spec says that files of kind "descriptions" and "metadata" are not
> displayed. It seems though that the parsing section will try two  
> interfaces:
> HTML and plain. I think there is a disconnect there. If we already know  
> that
> it's not parsable in HTML, why even try?

I was confused. The parsing algorithm does the same thing regardless of  
what kind of text track it is dealing with. I guess what you're saying is  
that non-browser applications also need to know that something is e.g.  
chapter markers, so that it can display it appropriately?

I don't have a strong opinion, but repeating the same information both in  
the containing document and in the subtitle file means that one of them  
will be ignored by browsers. People will copy-paste the ignored one and it  
will end up being wrong a lot of the time.

>>  * there is no version number on the format, thus it will be difficult  
>> to
>>> introduce future changes.
>>>
>>
>> I think we shouldn't have a version number, for the same reason that CSS
>> and HTML don't really have versions. If we evolve the WebSRT spec, it  
>> should
>> be in a backwards-compatible way.
>
>
> CSS and HTML are structured formats where you ignore things that you  
> cannot
> interpret. But the parsing is fixed and extensions play within this  
> parsing
> framework. I have my doubts that is possible with WebSRT. Already one
> extension that we are discussion here will break parsing: the  
> introduction
> of structured headers. Because there is no structured way of extending
> WebSRT, I believe the best way to communicate whether it is backwards
> compatible is through a version number. We can change the minor versions  
> if
> the compatibility is not broken - it communicates though what features  
> are
> being used - and we can change the major version of compatibility is  
> broken.

Similarly, I think that the WebSRT parser should be designed to ignore  
things that it doesn't recognize, in particular unknown voices (if we keep  
those). Requiring parsers to fail when the version number is increased  
makes it harder to introduce changes to the format, because you'll have to  
either break all existing implementations or provide one subtitle file for  
each version. (Having a version number but letting parsers ignore it is  
just weird, quite like in HTML.)

I filed a bug suggesting that voice is allowed to be an arbitrary string:  
<http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320> (From the point of  
view of the parser, it still wouldn't be valid syntax.)

>>  2. Break the SRT link.
>>>
>>
>>  * the mime type of WebSRT resources should be a different mime type to  
>> SRT
>>> files, since they are so fundamentally different; e.g. text/websrt
>>>
>>> * the file extension of WebSRT resources should be different from SRT
>>> files,
>>> e.g. wsrt
>>>
>>
>> I'm not sure if either of these would make a difference.
>
>
> Really? How do you propose that a media player identifies that it cannot
> parse a WebSRT file that has random metadata in it when it is called .srt
> and provided under the same mime type as SRT files? Or a transcoding
> pipeline that relies on srt files just being plain old simple SRT. It  
> breaks
> expectations with users, with developers and with software.

I think it's unlikely that people will offer download links to SRT files  
that aren't useful outside of the page, so random metadata isn't likely to  
reach end users or applications by accident. Also, most media frameworks  
rely mainly on sniffing, so even a file that uses lots of WebSRT-only  
features is quite likely going to be detected as SRT anyway. At least in  
GStreamer, the file extension is given quite little weight in guessing the  
type and MIME isn't used at all (because the sniffing code doesn't know  
anything about HTTP). Finally, seeing random metadata displayed on screen  
is about as good an indication that the file is "broken" as the  
application failing to recognize the file completely.

On the other hand, keeping the same extension and (unregistered) MIME type  
as SRT has plenty of benefits, such as immediately being able to use  
existing SRT files in browsers without changing their file extension or  
MIME type.

>>  4. Make full use of CSS
>>>
>>> In the current form, WebSRT only makes limited use of existing CSS. I  
>>> see
>>> particularly the following limitations:
>>>
>>> * no use of the positioning functionality is made and instead a new  
>>> means
>>> of
>>> positioning is introduced; it would be nicer to just have this reuse  
>>> CSS
>>> functionality. It would also avoid having to repeat the positioning
>>> information on every single cue.
>>>
>>
>> I agree, the positioning syntax isn't something I'm happy about with
>> WebSRT. I think treating everything that follows the timestamp to be CSS
>> that applies to the whole cue would be better.
>
>
> Or taking the positioning stuff out of WebSRT and moving it to an  
> external
> CSS file as is done with formatting would make it much simpler.

Ah, that would be great. It's quite likely that there will only be 1 or 2  
different positions in the whole file, which you don't want to repeat on  
each and every cue.

>>  * there is no definition of the "canvas" dimensions that the cues are
>>> prepared for (width/height) and expected to work with other than  
>>> saying it
>>> is the video dimensions - but these can change and the proportions  
>>> should
>>> be
>>> changed with that
>>>
>>
>> I'm not sure what you're saying here. Should the subtitle file be
>> hard-coded to a particular size? In the quite peculiar case where the  
>> same
>> subtitles really don't work at two different resolutions, couldn't we  
>> just
>> have two files? In what cases would this be needed?
>
>
> Most subtitles will be created with a specific width and height in mind.  
> For
> example, the width in characters relies on the video canvas having at  
> least
> that size and the number of lines used usually refers to a lower third  
> of a
> video - where that is too small, it might cover the whole video. So, my
> proposal is not the hard-code the subtitles to a particular size, but to  
> put
> the minimum width and height that are being used for the creation of the
> subtitles into the file. Then, the file can be scaled below or above this
> size to adjust to the actual available space.

In practice, does this mean scaling font-size by  
width_actual/width_intended or similar? Personally, I prefer subtitles to  
be something like 20 screen pixels regardless of video size, as that is  
readable. Making them bigger hides more of the video, while making them  
smaller makes them hard to read. But I guess we could let the CSS media  
query min-width and similar be evaluated against the size of the  
containing video element, to make it possible anyway.

-- 
Philip Jägenstedt
Core Developer
Opera Software