[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Wed Aug 25 02:20:55 PDT 2010

On Wed, 25 Aug 2010 09:16:56 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> On Tue, Aug 24, 2010 at 8:49 PM, Philip Jägenstedt  
> <philipj at opera.com>wrote:
>
>> On Tue, 24 Aug 2010 04:32:21 +0200, Silvia Pfeiffer <
>> silviapfeiffer1 at gmail.com> wrote:
>>
>>  On Mon, Aug 23, 2010 at 6:55 PM, Philip Jägenstedt <philipj at opera.com
>>> >wrote:
>>>
>>>   Aside: WebSRT can't contain binary data, only UTF-8 encoded text.
>>>>
>>>>>
>>>>>>
>>>>>
>>>>> It sure can. Just base-64 encode it. I'm not saying it's a good  
>>>>> thing,
>>>>> but
>>>>> if somebody really has an urge...
>>>>>
>>>>>
>>>> Sure, this would be a metadata track. Sites have no reason to offer
>>>> download links to it, and if anyone gets hold of such a file it would
>>>> quickly be evident that it's useless.
>>>>
>>>
>>>
>>> After a user has seen the crap on screen. I'm just saying: it's a legal
>>> WebSRT file and really not compatible with any existing infrastructure  
>>> for
>>> SRT.
>>>
>>
>> A fair point. The alternatives I can see are (1) using an incompatible
>> format so that the user sees nothing or (2) adding a header that  
>> indicates
>> that the track is metadata.
>>
>> In order to tell the user to stop wasting their time with this file, I
>> think (1) is clearly worse. (2) is absolutely an option, but it will  
>> only
>> make a difference to software that understands this header and if the  
>> header
>> is optional it will likely often be omitted. A dialog saying "this is a
>> metadata track, you can't watch it" is slightly friendlier than a screen
>> full of crap, but they are both pretty effective at getting the message
>> across.
>
>
>
> Yeah, I'm totally for adding a hint as to what format is in the cue.  
> Then, a
> WebSRT file can be identified as to what it contains.

OK, but note that a browser would ignore this and trust what <track kind>  
says. I wouldn't want the kind change after the external track is loaded,  
it would make the UI confusing if a captions track disappeared from the  
menu as soon as it was loaded because it internally claims to be metadata.

>>   If we define WebSRT in a way that can handle >99% of existing content  
>> and
>>>>
>>>>> degrade gracefully (enough) when using new features in old software,  
>>>>> it
>>>>>> seems reasonable to do. If lots of software developers cry foul,  
>>>>>> then
>>>>>> perhaps we should reconsider. It seems to me, though, that actually
>>>>>> researching and defining a good algorithm for parsing SRT would be  
>>>>>> of
>>>>>> use
>>>>>> to
>>>>>> others than just browsers.
>>>>>>
>>>>>>
>>>>>>  How is that different from moving away from SRT. If everyone has to
>>>>> change
>>>>> their parsing of SRT to accommodate a new spec, then that is a new
>>>>> format.
>>>>>
>>>>>
>>>> Not everyone has to change their parsers immediately, many will  
>>>> continue
>>>> to
>>>> work. However, if someone wants to support SRT in a compatible way,  
>>>> it's
>>>> very helpful to have a spec, assuming that WebSRT is actually  
>>>> compatible
>>>> enough with existing SRT content.
>>>>
>>>> This is quite similar to HTML4 vs HTML5. There are lots of mostly
>>>> compatible HTML parsers, but HTML5 defines a single parsing algorithm,
>>>> and
>>>> slow convergence towards that is a good thing.
>>>>
>>>>
>>> No, no, no! It is not at all similar to HTML4 and HTML5. A Web browser
>>> cannot suddenly stop working for a Web page, just because it has some
>>> extra
>>> functionality in it. Thus, the HTML format has been developed such  
>>> that it
>>> can be extended without breaking existing stuff. We can guarantee that  
>>> no
>>> browser will break because that is the way in which the format has been
>>> specified.
>>>
>>> No such thing has happened for SRT and there is simply no way to  
>>> guarantee
>>> that all new WebSRT files will work in all existing SRT software,  
>>> because
>>> SRT has not been specified as a extensible format and because there is  
>>> no
>>> agreement between all parties that have implemented SRT support as to  
>>> how
>>> extensions should be made.
>>>
>>> We can introduce such a thing for WebSRT, but we cannot claim it for  
>>> SRT.
>>>
>>
>> You are right, existing SRT parsers are probably far less interoperable
>> than HTML parsers were before HTML5.
>>
>> Existing content demands that SRT parsers handle at least <i>, <b>,  
>> <font>
>> and <u> in some manner, even if it is by ignoring it. Any parsers that  
>> treat
>> SRT as plain text don't even work with todays content, so I don't think  
>> they
>> should be considered at all.
>
>
> You've just defined what SRT is. I would actually define SRT as the plain
> text format and the <i>, <b>, <font> and <u> markup as extensions.

Perhaps SRT was originally plain text, but for a very long time now, files  
with the .srt extension contain markup, more than 50% do in the  
OpenSubtitles sample data. With nothing to differentiate the plain text  
and markup formats, there is effectively only one format, no matter what  
we choose to call it.

>> The question, then, is if parsers that handle the mentioned markup also
>> ignore <1>, <ruby> and <rt>. I haven't tested it, but I assume that some
>> will ignore it and some won't. How many percent of the media player  
>> market
>> would have to handle this correctly for these extensions to be OK, in  
>> your
>> opinion?
>
>
> If a single one breaks, it would be bad IMO because the expectations of  
> the
> users of that software will be broken even if it may just be a small
> percentage of users and we have no influence on the upgrade path of that
> software - in particular if it is proprietary.

Neither a new file extension, MIME type or header is enough to stop some  
implementations from treating it as SRT and break. The only remaining  
option, AFAICT, is making the format fundamentally incompatible with SRT.  
Is it worth it?

>>  If the SRT ecosystem is so fragile that it cannot tolerate any  
>> extension
>>>> whatsoever, then we should stay far away from it. It just seems that's
>>>> not
>>>> the case.
>>>>
>>>
>>>
>>> How do we know that everyone that uses SRT now really wants to use  
>>> WebSRT
>>> instead and wants to take part in the new ecosystem that we are
>>> introducing?
>>> We make some pretty big assumptions about what everyone who is not a  
>>> Web
>>> browser vendor wants to do with SRT. That doesn't make the existing SRT
>>> ecosystem fragile - but it makes it an existing environment that needs  
>>> to
>>> be
>>> respected.
>>>
>>
>> At this point, what is your recommendation? The following ideas have  
>> been
>> on the table:
>>
>> * Change the file extension to something other than .srt.
>>
>> I don't have an opinion, browsers ignore the file extension anyway.
>>
>
> Yes, I think we should definitely have a new file extension.

I'll leave this to others to decide, but since browsers have no concept of  
file extensions, just using .srt will work. If the format is SRT-like it's  
likely at least some files will use .srt in practice.

>> * Change the MIME type to something other than text/srt.
>>
>> I doubt it makes any difference, as most software that deal with SRT  
>> today
>> have no concept of MIME types. No matter what I'd want exactly 1 MIME  
>> type
>> or alternatively make browsers ignore the MIME type completely.
>>
>
> You're right in that existing SRT software probably doesn't deal much  
> with a
> SRT mime type. Right now text/x-srt or text/srt is sometimes used for SRT
> files, but often text/plain is also in use and more likely from a Web
> server. Since this is the space where Web browsers play, I am not overly
> fussed, though I think logically text/websrt makes more sense with a  
> .wsrt
> extension. Then, also SRT files can be served as text/websrt to allow  
> them
> to take part in the WebSRT infrastructure if indeed they will continue  
> to be
> valid WebSRT files.

Is there anything you expect would break if WebSRT files were served as  
text/srt?

> Incidentally, it is a problem if WebSRT files are served as text/plain,  
> i.e.
> will the browser not identify them as subtitle files?

<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#sourcing-out-of-band-timed-tracks>

Step 4 says:

"The tasks queued by the fetching algorithm on the networking task source  
to process the data as it is being fetched must examine the resource's  
Content Type metadata, once it is available, if it ever is. If no Content  
Type metadata is ever available, or if the type is not recognised as a  
timed track format, then the resource's format must be assumed to be  
unsupported (this causes the load to fail, as described below)."

In other words, browsers should have a whitelist of supported text track  
format, just like they should for audio and video formats. (Note though  
that Safari and Chrome ignore the MIME type for audio/video and will  
likely continue to do so.)

It seems to that a side-effect of this is that it will be impossible to  
test <track> on a local file system, as there's no MIME type and browsers  
aren't allowed to sniff. Surely this can't be the intention, Hixie?

>> * Add a header to WebSRT to make it uniquely identifiable.
>>
>> The header would have to be mandatory and browsers would have to reject
>> files that don't have it. Such files would be compatible with some  
>> existing
>> software and break some, depending on how they sniff. We could also put
>> metadata in such a header.
>>
>
> Yes, I think we need to introduce a header. Maybe we can hide all the
> structure in what SRT recognizes as comments (i.e. start the lines as  
> ";".
> But I believe we need some hints like the @profile to identify the type  
> of
> the cues and the <link> to link to a style sheet, and we need metadata  
> like
> the <meta> element of HTML headers.

I had no idea that semicolon was used for comments in SRT, is this usage  
widespread? Does it work in most players?

> * Make something deliberately incompatible with SRT.
>>
>> It doesn't make a big difference to browsers implementing the format.  
>> We'd
>> be replacing something that mostly works in existing players with  
>> something
>> that never works.
>>
>
> That was the idea of WMML and I took that path because I thought it  
> would be
> advantageous for other Web applications, such as built on libxml2, expat,
> php's SimpleXML, pyexpat for python, Nokogiri for ruby etc. But I really
> like the idea of WebSRT to allow arbitrary metadata in the cues without
> having to put it into CDATA sections.
>
> I don't mind creating a format that is still somewhat compatible with  
> SRT.
> We don't have to force incompatibility - but we should also not have it
> restrict us. In either case, it is a new format.

I'm not trying to be annoying, but this seems to clash with your  
preference to not break any existing software. Anything that resembles SRT  
*will* be treated as SRT in some existing players.

>> Here's the SRT research I promised:
>> http://blog.foolip.org/2010/08/20/srt-research/
>
>
> That is awesome work. I knew that most SRT files didn't use UTF-8, but I
> didn't know that we would make such a large percentage of files that are
> currently parsed by SRT software be incompatible. It is good data to  
> have.

Indeed, this makes reusing existing content highly unlikely outside of the  
ASCII-speaking world. Realistically, though, that wouldn't be very common  
anyway, especially not in the case of subtitles for movies and TV series  
like most of the content on OpenSubtitles. The benefit that remains is  
being able to use the same subtitle files in browsers and standalone  
players without conversion.

-- 
Philip Jägenstedt
Core Developer
Opera Software