[whatwg] Discussing WebSRT and alternatives/improvements

Tue Aug 24 23:36:09 PDT 2010

On Wed, Aug 25, 2010 at 12:21 AM, Henri Sivonen <hsivonen at iki.fi> wrote:

> On Aug 5, 2010, at 18:01, Silvia Pfeiffer wrote:
>
> > I developed WMML as a xml-based caption format that will not have the
> problems that have been pointed out for DFXP/TTML, namely: there are no
> namespaces, it doesn't use XSL-FO but instead fully reuses CSS, and it
> supports innerHTML markup in the cues instead of inventing its own markup.
> Check out the examples at
> https://wiki.mozilla.org/Accessibility/Video_Text_Format .
>
> The wiki page says it's *not* an XML format.
>

Yes, it's not XML, it's just XML-based - similar to how RSS is XML-based. In
fact, after introducing the flexibility that WebSRT has for cues it's even
less XML, which is a problem.

The main point about the experiment was to find out if there is an advantage
of doing a full XML format. I saw some advantages, but mostly I was
concerned about re-using the innerHTML parser for cues, which is where the
further discussion on this has gone in the meantime. Also, I found some
things that WebSRT doesn't accommodate yet, which I think we need to add.

I haven't heard from anyone yet wanting a XML format. I thought the
advantage of having existing XML parsers (e.g. libxml2, expat, pyexpat for
python, Nokogiri in ruby, etc)  be able to parse WMML would be an advantage,
but this doesn't seem to be the case. So, I haven't much pursued WMML other
than as a experiment to find out what was still missing in WebSRT.

> > * a @profile attribute which specifies the format used in the cues and
> thus the parser that should be chosen, including "plainText",
> "minimalMarkup", "innerHTML", "JSON", "any" (other formats can be developed)
>
> That looks like excessive complexity without solidly documented need or
> processing model.
>

How would the Web browser or in fact any parsing application know what to do
with the cues? This is actually a question for WebSRT. Unless there is a
hint as to how to parse the stuff in the cue, it would need to do something
like "content sniffing" to find out if it's "JSON" or "plain text" or
"minimal markup". Right now, the hint for how to parse the cue in WebSRT
comes from the track @kind attribute. That is not helpful for a stand-alone
application. That's why I proposed introduction of a @profile attribute
(feel free to choose a different name if that's what is confusing).

> > WMML doesn't have a <body> element, but instead has a <cuelist> element.
> It was important not to reuse <body> in order to allow only <cue> elements
> inside the main part of the WMML resource.
>
> Why couldn't the <cuelist> element be called <body> (and the <wmml> element
> <html>) with a conformance rule that <body> is only permitted to contain
> <cue> elements (that could be spelled <div>...)? Then you could use the HTML
> parsing algorithm on the whole file (if only the standalone case is
> supported--not muxing into a media file). Even if had an element called
> <cuelist>, you'd need to define error handling for the case where there are
> non-<cue> children.
>

Would you call those html files then? What about the need for timing
information on the <div> elements and the requirement to exclude any markup
that is not <div> in <body> when used as a caption format - that would
require both a restriction and an extension to HTML to make such files
useful as time-synchronized text formats. I don't think that can be resolved
without creating a new format, even if it reuses lots from HTML.

> If you made those respellings, we'd be back to Timed Divs, but see
> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2010-July/027283.htmlfor my argument against that proposal (and logically any proposal that's
> effectively the same with trivial renaming).
>

Yeah, I thought those objections should be spelled out a bit more - I don't
think we have yet sufficient understanding of these. Maybe this won't be
necessary in this group, but when we take it to the W3C the discussion will
probably return.

> > This makes it a document that can also easily be encapsulated in binary
> media resources such as WebM, Ogg or MPEG-4 because each cue is essentially
> a "codec data page" associated with a given timeline, while anything in the
> root and head element are "codec headers". In this way, the hierarchical
> document structure is easily flattened.
>
> As discussed in
> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2010-July/027283.html, muxing into a media resource places contraints on the processing model.
> For sanity, the rendering-time processing model should be the same for the
> muxed case and the standalone case. The logical consequences of the
> constraints imposed by muxing would make the necessary processing model
> violate the author expectations (that are markup document is available as a
> full document tree for selector matching) suggested by the WMML syntax in
> the standalone case.
>

It may be sufficient to just disallow neighbor selectors in CSS for cues to
overcome this. Any other selectors would still work.

> > 2. There is a natural mapping of WebSRT into in-band text tracks.
> > Each cue naturally maps into a encoding page (just like a WMML cue does,
> too). But in WebSRT, because the setup information is not brought in a
> hierarchical element surrounding all cues, it is easier to just chuck
> anything that comes before the first cue into an encoding header page. For
> WMML, this problem can be solved, but it is less natural.
>
> Worse, the effects of the "less natural" part violate reasonable author
> expectations of what the tree that participates in selector matching is.
>

No, I don't think that's generally the case. If we have a CSS selector in
the header that will always work except where a CSS rule is set up where a
style for a cue (or something in that cue) depends on another cue. We have
to disallow such cross-cue dependencies. But everything else would still
work.

> > 3. I am not too sure, but the "voice" markup may be useful.
> > At this point I do wonder whether it has any further use than a @class
> attribute has in normal markup, but the idea of providing some semantic
> information about the content in cues is interesting.
>
> I'm rather unconvinced by the voice markup as well. As far as I can tell,
> the voice markup is syntactic sugar for class for practical purposes. (I
> don't give much value to arguments that voices are more semantic than
> classes if the pratical purpose in to achieve visual effects for caption
> rendering.) Common translation subtitling works just fine without voice
> identification and (based on information in this thread) the original .srt
> doesn't have voices.
>
> If voices are really needed for captioning use cases, I think it makes
> sense to balance the rarity of that need within the captioning spherewith
> the complexity of introducing syntactic sugar over the class attribute and
> the class selector. Do all captions use voice identification? Many? Only
> some? If only some captions (compared to the set of all captions--not to the
> set of captions plus subtitles) use voice identification, perhaps class in
> good enough. If nearly all captions use voice identification, sugaring might
> have merit.
>

In caption formats where a voice identification is available it is only used
in the way that classes are used - to achieve visual effects.

> > * there is no possibility to add file-wide metadata to WebSRT; things
> about authoring and usage rights as well as information about the media
> resource that the file relates to should be kept within the file. Almost all
> subtitle and caption format have the possibility for such metadata and we
> know from image, music and video resources how important it is to have the
> ability to keep such metadata inside the resource.
>
> Generic metadata mechanisms are a slippery slope into a rathole, so it
> would be good not to go there. (One minute you have something like Dublin
> Core that looks simple enough and the next minute you have RDF, ontologies,
> etc., etc.)
>

Yes, and I don't have a problem with that. People have a need and turning
our back on this need is not a good way to solve it. Re-using existing
solutions that we have already come up with for Dublin Core and microformats
would be much better IMHO. Directing how it should be done rather than
leaving it to the users that haven't got their need solved and will find a
way (and if it means sticking RDF into comments in WebSRT....).

> > * there is no means to identify which parser is required in the cues (is
> it "plain text", "minimal markup", or "anything"?) and therefore it is not
> possible for an application to know how it should parse the cues.
>
> I think it would be reasonable to always use the HTML fragment parsing
> algorithm for cues and require authors who just want something plain
> text-ish to escape < and &.
>

There is also WebSRT metadata text that is allowed in cues, which is
absolutely anything - could be base64 binary or JSON or some other
structured markup. Can we really throw the innerHTML parser at all of that?

> On Aug 10, 2010, at 12:49, Philip Jägenstedt wrote:
>
> > An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>"
> and "<00:01:30>". Without having read the HTML parsing algorithm I guess
> that elements need to begin with a letter or similar. So, it's not possible
> to (ab)use the HTML parser to handle inner timestamps of numerical voices,
> we'd have to replace those with something else, probably more verbose.
>
>
> Given that voices (according to this thread; I didn't check) are a Hixie
> invention rather than an original feature of SRT, the <1> syntax doesn't
> have to be that way for compat. Instead, something that works in HTML
> without parser modifications could be used.
>
> As for <00:01:30>, normal subtitles and, AFAIK, normal captions don't need
> time-based revelation of parts of the cue. (I'm not considering anime
> fansubbing "normal"; I mean TV, DVD and movie theater subtitles. Also,
> continuous revelation of live captioning isn't relevant to the <00:01:30>
> feature.) Since the <00:01:30> isn't essential for making the feature set of
> HTML5 accessible, I think the <00:01:30> feature for karaoke and anime
> fansubbing should be considered for removal if it happens to have any
> undesirable properties--and not working without HTML parser modifications is
> such an undesirable property.
>

You're also excluding roll-on captions then which is a feature of live
broadcasting.

I think it might be possible to use CSS transitions (or even animations) on
this. I am particularly looking at transition-delay and transition-duration
on a opacity property, where the transition-delay would be relative to the
start of the cue. Then it would be possible to use it on <span
style="..."></span>. This is a very verbose solution, but maybe a
work-around for a special situation.

I don't mind the <t at=00:01:30>...</t> solution either if it is possible
within the confines of a innerHTML parser.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100825/f6d25e7d/attachment-0002.htm>