[whatwg] Fwd: Discussing WebSRT and alternatives/improvements
Philip Jägenstedt
philipj at opera.com
Thu Aug 26 03:06:09 PDT 2010
On Wed, 25 Aug 2010 17:40:08 +0200, Silvia Pfeiffer
<silviapfeiffer1 at gmail.com> wrote:
>> At this point, what is your recommendation? The following ideas have
>> been
>>>>>> on the table:
>>>>>>
>>>>>> * Change the file extension to something other than .srt.
>>>>>>
>>>>>> I don't have an opinion, browsers ignore the file extension anyway.
>>>>>>
>>>>>>
>>>>>> Yes, I think we should definitely have a new file extension.
>>>>>
>>>>>
>>>> I'll leave this to others to decide, but since browsers have no
>>>> concept
>>>> of
>>>> file extensions, just using .srt will work. If the format is SRT-like
>>>> it's
>>>> likely at least some files will use .srt in practice.
>>>>
>>>
>>>
>>> All SRT files in practice use the .srt extension - it is typically how
>>> these
>>> formats are identified by applications. Just because *nix ignores file
>>> extensions mostly for identifying file types doesn't mean that
>>> applications
>>> do. Again, I believe strongly that re-using the same file extension is
>>> the
>>> one biggest pain we can inflict on the community.
>>>
>>
>> As shown above, several popular (?) media players ignore or give little
>> weight to the file extension.
>
>
> I don't think that's a fair sample - as I said, on Linux and on the
> command-line things are different. I have a GUI mplayer here and it
> reacts
> like VLC - doesn't let me open .wsrt files. The vast majority of
> applications on Windows and the Mac make their decision on whether they
> support files based on the file extension.
That the file selection dialogs are filtered by file extensions doesn't
mean that applications don't sniff the content. In fact, MPlayer, VLC and
Totem will happily load and use an SRT file even if it is called foo.smi,
even though SAMI is a completely incompatible format. In other words, they
sniff the content as being SRT. The reason that they rely on sniffing is
likely that many files use the wrong file extension (my OpenSubtitles
batch have no extensions, so I have no statistics on this).
Again, if we want to avoid exposing existing SRT parsers to WebSRT syntax,
then the format needs to be more incompatible. File extensions will be
changed, popular players rely on sniffing, some ignore leading garbage and
also headers can simply be removed by naive conversion tools.
> Assuming we pick the same file extension and we now have a new
> application
> that only supports WebSRT parsing, we will make a large bunch of existing
> valid SRT files invalid - not only those that are not in UTF-8, but also
> those with <font>..</font> and <u>...</u>. I do wonder if the text
> between
> the <font> start and end element and inside the <u>..</u> may even get
> removed because of lack of support for these.
I've seen no application that removes everything between tags it doesn't
recognize, the only things that I've seen happen is treating it as plain
text or ignoring the tags much like a browser does with HTML.
>> * Add a header to WebSRT to make it uniquely identifiable.
>>>>
>>>>>
>>>>>> The header would have to be mandatory and browsers would have to
>>>>>> reject
>>>>>> files that don't have it. Such files would be compatible with some
>>>>>> existing
>>>>>> software and break some, depending on how they sniff. We could also
>>>>>> put
>>>>>> metadata in such a header.
>>>>>>
>>>>>>
>>>>>> Yes, I think we need to introduce a header. Maybe we can hide all
>>>>>> the
>>>>> structure in what SRT recognizes as comments (i.e. start the lines as
>>>>> ";".
>>>>> But I believe we need some hints like the @profile to identify the
>>>>> type
>>>>> of
>>>>> the cues and the <link> to link to a style sheet, and we need
>>>>> metadata
>>>>> like
>>>>> the <meta> element of HTML headers.
>>>>>
>>>>>
>>>> I had no idea that semicolon was used for comments in SRT, is this
>>>> usage
>>>> widespread? Does it work in most players?
>>>>
>>>
>>>
>>> I thought it was, but maybe it was just introduced for WebSRT. It is
>>> not
>>> tested in Hixie's SRT research[2]. Can you take a quick look through
>>> your
>>> SRT file collection if there are any? I'm probably wrong about this
>>> seeing
>>> as it's not mentioned in the wiki page for SRT [3].
>>>
>>> [2] http://wiki.whatwg.org/wiki/SRT_research
>>> [3] http://en.wikipedia.org/wiki/SubRip
>>>
>>
>> OK, I grepped the 10000 files. Only 15 had any lines beginning with a
>> semicolon, and by manual inspection it doesn't look like any of them are
>> clearly intended as comments (it's hard to tell, all are in foreign
>> languages). None of them were at the very beginning of the file.
>
>
> Ah, that actually makes for another incompatibility of WebSRT and SRT:
> such
> lines are regarded as comments in WebSRT when they probably aren't in
> SRT.
I can't find anything about this when searching for "comment" and
"semicolon" in the spec, are you sure you're not thinking of some other
format than WebSRT?
> It seems increasingly that the only thing that WebSRT and SRT still have
> in
> common is the "-->" character sequence. As a friend of mine in a11y
> recently
> said: "I was hoping to never have to stare at "-->" ever again... We
> could
> indeed go all the way and define an much more different format, though I
> don't think it will create implementations as quickly as a SRT-based but
> changed format.
I would prefer if we follow one of two paths:
1. Let WebSRT be maximally compatible with SRT, making it a "retro-spec"
of existing SRT use with extensions that cause as little breakage as
possible in the ecosystem.
2. Make something incompatible and rid ourselves of all legacy
constraints. For example, there would be no need to accept both period and
comma as a separator between seconds and milliseconds.
I can't see any insurmountable issues with option 1, but would want to
hear from actual media player developers, not just our guesses of what
they might think. Option 2 would also be fine. Something in between, where
we try to make it a little bit incompatible in order to make people aware
that there *might* be some compatibility issues, is not something I'm
interested in.
--
Philip Jägenstedt
Core Developer
Opera Software
More information about the whatwg
mailing list