[whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Sat Aug 7 00:57:39 PDT 2010

Hi Philip,

On Sat, Aug 7, 2010 at 1:50 AM, Philip Jägenstedt <philipj at opera.com> wrote:

> If @profile should have any influence on the parser it sounds like this
> isn't actually XML at all. In particular, the "HTML" would have to be
> well-formed XML, but would still end up in the null namespace.

Yeah, you are right  - I suppose I was trying to imitate the flexibility of
WebSRT there with an "anything" option.

> I guess simply cloning the child nodes of <cue> and changing their
> namespace to  before inserting them into an iframe-like document might work,
> but would be quite odd, I think you'll agree.
>

Yes, it's no different to WebSRT in that respect.

> * there is a possibility to provide script that just affects the
>> time-synchronized text resource
>>
>
> I agree that some metadata would be useful, more on that below. I'm not
> sure why we would want to run scripts inside the text document, though, when
> that can be accomplished by using the TimedTrack API from the containing
> page.

Scripts inside a timed text document would only be useful for applications
that use the track not in conjunction with a Web page.

>
>  The <cue> elements have a start and end time attribute and contain
>> innerHTML, thus there is already parsing code available in Web browsers to
>> deal with this content. Any Web content can be introduced into a <cue> and
>> the Web browsers will already be able to render it.
>>
>
> Yes, but if the HTML parser can't be used for all of WMML, it makes the
> parser quite odd, being neither XML or HTML. I think that realistically the
> best way to make an XML-like format is to simply use XML.

OK. Then everything that's not supposed to be parsed inside a <cue> would be
escaped. I guess that works, too.

>
> 2. There is a natural mapping of WebSRT into in-band text tracks.
>> Each cue naturally maps into a encoding page (just like a WMML cue does,
>> too). But in WebSRT, because the setup information is not brought in a
>> hierarchical element surrounding all cues, it is easier to just chuck
>> anything that comes before the first cue into an encoding header page. For
>> WMML, this problem can be solved, but it is less natural.
>>
>
> I really like the idea of letting everything before the first timestamp in
> WebSRT be interpreted as the header. I'd want to use it like this:
>
> # author: Fan Subber
> # voices: <1> Boy
> #         <2> Girl
>
> 01:23:45.678 --> 01:23:46.789
> <1> Hello
>
> 01:23:48.910 --> 01:23:49.101
> <2> Hello
>
> It's not critical that the format of the header be machine-readable, but we
> could of course make up a key-value syntax, use JSON, or something else.

I disagree. I think it's absolutely necessary that the format of the header
be machine-readable. Just like EXIF in images is machine readable or ID3 in
MP3 is machine-readable. It would be counter-productive not to have it
machine-readable, in particular useless to archiving and media management
solutions.

>
>
> I'm not sure of the best solution. I'd quite like the ability to use
> arbitrary voices, e.g. to use the names/initials of the speaker rather than
> a number, or to use e.g. <shouting> in combination with CSS :before {
> content 'Shouting: ' } or similar to adapt the display for different
> audiences (accessibility, basically).

I agree. I think we can go back to using<span> and @class and @id and that
would solve it all.

>
>  4. It's a light-weight format in that it is not very verbose.
>> It is nice for hand-authoring if you don't have to write so much. This is
>> particularly true for the simple case. E.g. if new-lines that you author
>> are
>> automatically kept as newlines when interpreted. The drawbacks here are
>> that
>> as soon as you include more complicated markup into the cues (e.g. HTML
>> markup or a SVG image), you're not allowed to put empty lines into it
>> because they have a special meaning. So, while it is true that the number
>> of
>> characters for WebSRT will always be less than for any markup-based
>> format,
>> this may be really annoying in any of the cases that need more than plain
>> text.
>>
>
> It would be easy to just let the parser consume all lines until the next
> timestamp, but do you really want to separate two lines with a blank line?
> If the two lines aren't really related, one could instead have two cues with
> different vertical positioning.

In marked-up content for readability I would at least not want every newline
to impose a new display line. But I suppose since it's of kind "metadata"
anyway, that wouldn't happen. So, I see - it's not such a big issue.

>
>
>  Point 2 is possible in WMML through "encoding" all outer markup in a
>> header
>> and the cues in the data packets.
>>
>
> To be clear, this would be a new codec type for the container, since I'm
> not aware of any that allow stating that the cue text is HTML. The same is
> true of WebSRT, muxing it into e.g. WebM would require the ability to
> express the kind from <track kind="captions"> (although in practice such
> metadata in binary files ends up almost always being incorrect).

All text tracks that are encoded into a binary container will be regarded as
a new codec type. Unless e.g. WebSRT or WMML can be mapped onto e.g. Kate
for encoding in Ogg, or onto 3GPP for encoding into MPEG-4, or onto QTText
for encoding into QuickTime.

>
>  Point 3 is also possible in WMML through the use of the @class attribute
>> on
>> cues.
>>
>
> I'd want this or something like it in WebSRT.

Cool. The whole point of this exercise is to identify improvement needs for
WebSRT. ;-)

>
>
>> * there is no language specification for a WebSRT resource; while this
>> will
>> not be a problem when used in conjunction with a <track> element, it still
>> is a problem when the resource is used just by itself, in particular as a
>> hint for font selection and speech synthesis.
>>
>
> The language inside the WebSRT file wouldn't end up being used for anything
> by a browser, as it needs to know the language before downloading it to know
> whether or not to download it at all. Still, I'd like a header section in
> WebSRT. I think the parser is already defined so that it would ignore
> garbage before the first cue, so this is more a matter of making it legal
> syntax.

Not quite. Some metadata in the header can make sense to also expose to the
Web page.

I agree that we need a structured header section in WebSRT.

>
>  * there is no magic identifier for a WebSRT resource, i.e. what the <wmml>
>> element is for WMML. This makes it almost impossible to create a program
>> to
>> tell what file type this is, in particular since we have made the line
>> numbers optional. We could use "-->" as an indicator, but it's not a good
>> signature.
>>
>
> If it's more important than easy-of-authoring, we could require WebSRT
> files to begin with a magic string and require browsers to reject them
> otherwise. I don't support this though, there's not much benefit.

It's a hint that is useful beyond the browser. For example, command-line
tools that identify file types use such magic strings.

>
>  * there is no means to identify which parser is required in the cues (is
>> it
>> "plain text", "minimal markup", or "anything"?) and therefore it is not
>> possible for an application to know how it should parse the cues.
>>
>
> All the types that are actually for visual rendering are parsed in the same
> way, aren't they? Of course there's no way for non-browsers to know that
> metadata tracks aren't interesting to look at as subtitles, but I think
> showing the user the garbage is a quicker to communicate that the file isn't
> for direct viewing than hiding the text or similar.

The spec says that files of kind "descriptions" and "metadata" are not
displayed. It seems though that the parsing section will try two interfaces:
HTML and plain. I think there is a disconnect there. If we already know that
it's not parsable in HTML, why even try?

>
>  * there is no version number on the format, thus it will be difficult to
>> introduce future changes.
>>
>
> I think we shouldn't have a version number, for the same reason that CSS
> and HTML don't really have versions. If we evolve the WebSRT spec, it should
> be in a backwards-compatible way.

CSS and HTML are structured formats where you ignore things that you cannot
interpret. But the parsing is fixed and extensions play within this parsing
framework. I have my doubts that is possible with WebSRT. Already one
extension that we are discussion here will break parsing: the introduction
of structured headers. Because there is no structured way of extending
WebSRT, I believe the best way to communicate whether it is backwards
compatible is through a version number. We can change the minor versions if
the compatibility is not broken - it communicates though what features are
being used - and we can change the major version of compatibility is broken.

>
>  2. Break the SRT link.
>>
>
>  * the mime type of WebSRT resources should be a different mime type to SRT
>> files, since they are so fundamentally different; e.g. text/websrt
>>
>> * the file extension of WebSRT resources should be different from SRT
>> files,
>> e.g. wsrt
>>
>
> I'm not sure if either of these would make a difference.

Really? How do you propose that a media player identifies that it cannot
parse a WebSRT file that has random metadata in it when it is called .srt
and provided under the same mime type as SRT files? Or a transcoding
pipeline that relies on srt files just being plain old simple SRT. It breaks
expectations with users, with developers and with software.

>
>  4. Make full use of CSS
>>
>> In the current form, WebSRT only makes limited use of existing CSS. I see
>> particularly the following limitations:
>>
>> * no use of the positioning functionality is made and instead a new means
>> of
>> positioning is introduced; it would be nicer to just have this reuse CSS
>> functionality. It would also avoid having to repeat the positioning
>> information on every single cue.
>>
>
> I agree, the positioning syntax isn't something I'm happy about with
> WebSRT. I think treating everything that follows the timestamp to be CSS
> that applies to the whole cue would be better.

Or taking the positioning stuff out of WebSRT and moving it to an external
CSS file as is done with formatting would make it much simpler.

>
>  * there is no definition of the "canvas" dimensions that the cues are
>> prepared for (width/height) and expected to work with other than saying it
>> is the video dimensions - but these can change and the proportions should
>> be
>> changed with that
>>
>
> I'm not sure what you're saying here. Should the subtitle file be
> hard-coded to a particular size? In the quite peculiar case where the same
> subtitles really don't work at two different resolutions, couldn't we just
> have two files? In what cases would this be needed?

Most subtitles will be created with a specific width and height in mind. For
example, the width in characters relies on the video canvas having at least
that size and the number of lines used usually refers to a lower third of a
video - where that is too small, it might cover the whole video. So, my
proposal is not the hard-code the subtitles to a particular size, but to put
the minimum width and height that are being used for the creation of the
subtitles into the file. Then, the file can be scaled below or above this
size to adjust to the actual available space.

>  IN SUMMARY
>>
>> Having proposed a xml-based format, it would be good to understand reasons
>> for why it is not a good idea and why a plain text format that has no
>> structure other than that provided through newlines and start/end time
>> should be better and more extensible.
>>
>> Also, if we really are to go with WebSRT, I am looking for a discussion on
>> those suggested improvements.
>>
>
> Thanks, lots of good suggestions and feedback. To sum it up, I wouldn't be
> opposed to an XML format as such, but it seems that WMML isn't quite XML.
> WebSRT also has its problems, or course...
>

Yeah, I'm not sure an optimal format exists (as is always reality). I used
WMML mostly as an experiment to see and understand the differences so we can
make some informed design decisions when going either way. In a lot of
aspects they have to deal with the same problems though.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100807/d3f05272/attachment-0002.htm>