[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Wed Aug 25 00:16:56 PDT 2010

On Tue, Aug 24, 2010 at 8:49 PM, Philip Jägenstedt <philipj at opera.com>wrote:

> On Tue, 24 Aug 2010 04:32:21 +0200, Silvia Pfeiffer <
> silviapfeiffer1 at gmail.com> wrote:
>
>  On Mon, Aug 23, 2010 at 6:55 PM, Philip Jägenstedt <philipj at opera.com
>> >wrote:
>>
>>   Aside: WebSRT can't contain binary data, only UTF-8 encoded text.
>>>
>>>>
>>>>>
>>>>
>>>> It sure can. Just base-64 encode it. I'm not saying it's a good thing,
>>>> but
>>>> if somebody really has an urge...
>>>>
>>>>
>>> Sure, this would be a metadata track. Sites have no reason to offer
>>> download links to it, and if anyone gets hold of such a file it would
>>> quickly be evident that it's useless.
>>>
>>
>>
>> After a user has seen the crap on screen. I'm just saying: it's a legal
>> WebSRT file and really not compatible with any existing infrastructure for
>> SRT.
>>
>
> A fair point. The alternatives I can see are (1) using an incompatible
> format so that the user sees nothing or (2) adding a header that indicates
> that the track is metadata.
>
> In order to tell the user to stop wasting their time with this file, I
> think (1) is clearly worse. (2) is absolutely an option, but it will only
> make a difference to software that understands this header and if the header
> is optional it will likely often be omitted. A dialog saying "this is a
> metadata track, you can't watch it" is slightly friendlier than a screen
> full of crap, but they are both pretty effective at getting the message
> across.

Yeah, I'm totally for adding a hint as to what format is in the cue. Then, a
WebSRT file can be identified as to what it contains.

>   If we define WebSRT in a way that can handle >99% of existing content and
>>>
>>>> degrade gracefully (enough) when using new features in old software, it
>>>>> seems reasonable to do. If lots of software developers cry foul, then
>>>>> perhaps we should reconsider. It seems to me, though, that actually
>>>>> researching and defining a good algorithm for parsing SRT would be of
>>>>> use
>>>>> to
>>>>> others than just browsers.
>>>>>
>>>>>
>>>>>  How is that different from moving away from SRT. If everyone has to
>>>> change
>>>> their parsing of SRT to accommodate a new spec, then that is a new
>>>> format.
>>>>
>>>>
>>> Not everyone has to change their parsers immediately, many will continue
>>> to
>>> work. However, if someone wants to support SRT in a compatible way, it's
>>> very helpful to have a spec, assuming that WebSRT is actually compatible
>>> enough with existing SRT content.
>>>
>>> This is quite similar to HTML4 vs HTML5. There are lots of mostly
>>> compatible HTML parsers, but HTML5 defines a single parsing algorithm,
>>> and
>>> slow convergence towards that is a good thing.
>>>
>>>
>> No, no, no! It is not at all similar to HTML4 and HTML5. A Web browser
>> cannot suddenly stop working for a Web page, just because it has some
>> extra
>> functionality in it. Thus, the HTML format has been developed such that it
>> can be extended without breaking existing stuff. We can guarantee that no
>> browser will break because that is the way in which the format has been
>> specified.
>>
>> No such thing has happened for SRT and there is simply no way to guarantee
>> that all new WebSRT files will work in all existing SRT software, because
>> SRT has not been specified as a extensible format and because there is no
>> agreement between all parties that have implemented SRT support as to how
>> extensions should be made.
>>
>> We can introduce such a thing for WebSRT, but we cannot claim it for SRT.
>>
>
> You are right, existing SRT parsers are probably far less interoperable
> than HTML parsers were before HTML5.
>
> Existing content demands that SRT parsers handle at least <i>, <b>, <font>
> and <u> in some manner, even if it is by ignoring it. Any parsers that treat
> SRT as plain text don't even work with todays content, so I don't think they
> should be considered at all.

You've just defined what SRT is. I would actually define SRT as the plain
text format and the <i>, <b>, <font> and <u> markup as extensions.

> The question, then, is if parsers that handle the mentioned markup also
> ignore <1>, <ruby> and <rt>. I haven't tested it, but I assume that some
> will ignore it and some won't. How many percent of the media player market
> would have to handle this correctly for these extensions to be OK, in your
> opinion?

If a single one breaks, it would be bad IMO because the expectations of the
users of that software will be broken even if it may just be a small
percentage of users and we have no influence on the upgrade path of that
software - in particular if it is proprietary.

>
>  If the SRT ecosystem is so fragile that it cannot tolerate any extension
>>> whatsoever, then we should stay far away from it. It just seems that's
>>> not
>>> the case.
>>>
>>
>>
>> How do we know that everyone that uses SRT now really wants to use WebSRT
>> instead and wants to take part in the new ecosystem that we are
>> introducing?
>> We make some pretty big assumptions about what everyone who is not a Web
>> browser vendor wants to do with SRT. That doesn't make the existing SRT
>> ecosystem fragile - but it makes it an existing environment that needs to
>> be
>> respected.
>>
>
> At this point, what is your recommendation? The following ideas have been
> on the table:
>
> * Change the file extension to something other than .srt.
>
> I don't have an opinion, browsers ignore the file extension anyway.
>

Yes, I think we should definitely have a new file extension.

> * Change the MIME type to something other than text/srt.
>
> I doubt it makes any difference, as most software that deal with SRT today
> have no concept of MIME types. No matter what I'd want exactly 1 MIME type
> or alternatively make browsers ignore the MIME type completely.
>

You're right in that existing SRT software probably doesn't deal much with a
SRT mime type. Right now text/x-srt or text/srt is sometimes used for SRT
files, but often text/plain is also in use and more likely from a Web
server. Since this is the space where Web browsers play, I am not overly
fussed, though I think logically text/websrt makes more sense with a .wsrt
extension. Then, also SRT files can be served as text/websrt to allow them
to take part in the WebSRT infrastructure if indeed they will continue to be
valid WebSRT files.

Incidentally, it is a problem if WebSRT files are served as text/plain, i.e.
will the browser not identify them as subtitle files?

> * Add a header to WebSRT to make it uniquely identifiable.
>
> The header would have to be mandatory and browsers would have to reject
> files that don't have it. Such files would be compatible with some existing
> software and break some, depending on how they sniff. We could also put
> metadata in such a header.
>

Yes, I think we need to introduce a header. Maybe we can hide all the
structure in what SRT recognizes as comments (i.e. start the lines as ";".
But I believe we need some hints like the @profile to identify the type of
the cues and the <link> to link to a style sheet, and we need metadata like
the <meta> element of HTML headers.

* Make something deliberately incompatible with SRT.
>
> It doesn't make a big difference to browsers implementing the format. We'd
> be replacing something that mostly works in existing players with something
> that never works.
>

That was the idea of WMML and I took that path because I thought it would be
advantageous for other Web applications, such as built on libxml2, expat,
php's SimpleXML, pyexpat for python, Nokogiri for ruby etc. But I really
like the idea of WebSRT to allow arbitrary metadata in the cues without
having to put it into CDATA sections.

I don't mind creating a format that is still somewhat compatible with SRT.
We don't have to force incompatibility - but we should also not have it
restrict us. In either case, it is a new format.

> Here's the SRT research I promised:
> http://blog.foolip.org/2010/08/20/srt-research/

That is awesome work. I knew that most SRT files didn't use UTF-8, but I
didn't know that we would make such a large percentage of files that are
currently parsed by SRT software be incompatible. It is good data to have.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100825/8c947ad4/attachment-0002.htm>