[whatwg] sic element, was: Re: Exposing spelling/grammar suggestions in contentEditable

Martin Janecke whatwg.org at kaor.in
Sat Apr 30 12:42:48 PDT 2011

I've been convinced that the there's not enough need for a <sic> element to introduce one, mostly by

Tab Atkins Jr. http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-December/029585.html and Benjamin Hawkes-Lewis http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-December/029586.html


However, I disagree with some points made in the latter e-mail. I'll address these points below. Now that I agree that the proposed element doesn't need to be introduced, the following is largely irrelevant though.

Am 31.12.2010 um 17:30 schrieb Benjamin Hawkes-Lewis:

> On Fri, Dec 31, 2010 at 3:17 PM, Martin Janecke <whatwg.org at kaor.in> wrote:
>> Am 30.12.2010 um 22:49 schrieb Benjamin Hawkes-Lewis:
> [snip]
>>> 1. What problem(s) does indicating where mistakes have been reproduced
>>> solve?
>> I understand the question in this context as a concrete formulation of
>> questions such as "What problem(s) does meta data solve? What problem(s) does
>> semantic markup solve?"
> Not really. Semantic markup is a tool HTML uses to solve problems. The sort of
> problem statements we're looking for are things more like this. End-users need
> to find information within complicated pages quickly. By marking up headings
> semantically we allow users to scan the page visually, or select a heading from
> a list, or jump to the next heading with a shortcut key.

I see. Nevertheless, I don't really understand the necessity to explain why one would indicate where mistakes have been reproduced in a quoted text. It's a common thing to do and sometimes it is actually important to do it. I think we all know why text is sic'ed, don't we?

I do understand the necessity to discuss whether a new HTML element is the best approach to the problem, though.

>> Apart from informing human readers about the correct reproduction of a
>> misspelled word, a HTML <sic> would indicate the same to web applications.
>> Think of a search engine, which, as one factor of their ranking algorithm,
>> considers orthography and grammar in a page as quality factor. The search
>> engine could be made to ignore (reasonably few) <sic>-marked errors in such
>> an algorithm; i.e. not let <sic>-marked errors rank the page lower.
> Would search engines benefit from markup for this?
> Seems to me it would be fairly easy for an search engine to spot plain text
> "[sic] and act accordingly.

Sic'ing text is a concept independent of the human language the sic'ed text is written in. Writing "[sic]" in plain text is – although common in many languages – not used universally in all languages. I'm sure it's expressed differently in Japanese, for example. Recognizing an element is easier than recognizing many different ways to express this in different languages. Also, computer programs could better act according to an explicitly sic'ed range of text ("the quick <sic>born phox</sic> jumps over ...") than to a single plain text "[sic]" which just indicates a reproduced error somewhere before ("the quick born phox [sic] jumps over ...").

> [...] In either case, I think the effect of all this on rankings would in practice be
> so small that it wouldn't be worth the costs to add a feature to HTML to support
> it. Search engine vendor testimony to the opposite would be very useful here.

I agree. There doesn't seem to be much interest in it.

>> I think <sic> is a more HTMLish solution than a plain text "[sic]" -- just
>> like <ul><li>...<li>... styled with list-style-type:decimal is more HTMLish
>> than <div>1. ...<div>2. ...
> I think that's like arguing "<sentence>The cat sat on the mat</sentence>" is
> more "HTMLish" than "The cat sat on the mat." ;)

I disagree. But your comment made me think more about what makes <sic> and <ol> (sorry, I meant <ol> instead <ul> for a numbered list, of course!) better HTML, while I'd reject <sentence>.

Both examples "<div>1. ...<div>2. ..." and " … [sic]" convey textual information (represented by "..." here) and both examples also include meta information: the first example an intentional order of items, the second example an intentional (mis)spelling to reproduce an original source without changes. In both examples this meta information is hard-coded as additional plain text in one specific style, although other styles are very well possible: for example people also use "a. … b. …" to convey an order and "(sic!)" or "[<i>sic</i>]" to convey intentional (mis)spellings in English texts. You'll find other styles in other languages to convey the same meta information. But note that not the meta information itself is language specific. Only the way of conveying the meta information is.

Furthermore, in case of "[sic]", its language specific style doesn't depend on the language of the quote it is placed in, but on the language of the context of the quote. For example, a webpage in Chinese that quotes an English text, would rather style <sic> in a Chinese way within the English text, because the <sic> isn't part of the original English text but an information added by the Chinese publisher for a Chinese audience.

What I consider more "HTMLish" is to encode the meta information, which itself neither depends on a specific style nor natural language, with a semantic HTML element, and leave the presentation to a styling language such as CSS.

In your example on the other hand you're encoding what is expressed by the full stop with a <sentence> element. I'd argue that the concept of a sentence itself is language specific, in this case the English language, and that the full stop is a defined part of the grammar of this language. I don't think adding "[sic]" or "1. … 2. …" is part of English grammar. You'll find those in style guides.

Something distantly related to <sentence> -- but more semantic, less syntactic -- has its element in HTML: <p>

>> The plain text string "[sic]" doesn't indicate where the start of the
>> "[sic]"ed part of text is. That means it provides less information than
>> <sic>...</sic>.
>> "[sic]" can't be handled with @media and CSS in general.
> Why does it need to be? Is applying different styling to indicate mistakes in
> the original actually a common publisher need (unlike being able to style
> headings or block quotations, for example)?

Including the plaintext "[sic]" in a text – without meaning to change the textual information of the original text – *is* a applying a special style to indicate mistakes in the original source. That's indeed a common publisher need. The introduction of a <sic> element would allow publishers to handle as a style what always has been a style.

>> Note that you can very well style <sic> as "[sic]" with CSS, if that's the
>> form of presentation you prefer: sic:after {content:" [sic] "}
> You can. However, that loses the linguistic information that "sic" is a Latin
> word, which is (theoretically) important for correct pronunciation by speaking
> agents.

It's actually a recognized English word, too (http://en.wikipedia.org/wiki/Sic#Etymology_and_usage_history). I'm not familiar with speaking agents, but I assume they should be able to recognize elements such as <em> and <strong>? They would also be able to interpret <sic> then – and possibly very different from a hard-coded "[sic]". For example, a speaking agent could pause a second and change the tone to indicate "sic" isn't part of the original text, instead of saying "opening bracket sic closing bracket". I think that speaking agents are actually a good example where such an element could bring benefits, theoretically. In practice, I haven't seen a sign of interest in this.

>>> 4. It seems like "sic" would be a very rarely used feature. Why do we need
>>> to include it in the small, core HTML vocabulary rather than an RDF
>>> vocabulary imported into HTML via annotations like RDFa, microdata, or
>>> microformats?
>> Extensions such as microformats are less widely known and probably always
>> will be.
> Agreed. This seems appropriate for features that would be rarely used.

I disagree. Let me explain this by means of an analogy.

A circulatory shock is a rare medical problem for a single person. I wouldn't say it was appropriate that only few people knew how to do first aid in cases of shocks, though. If only few people knew how to handle shocks, they'd hardly be treated correctly. Although shocks are rare, they are a widespread phenomenon.

On the other hand, occurrences of seasickness are less widespread, although much less rare in certain areas (i.e. at sea). It's definitely not necessary for everyone to know how to treat seasickness. It's sufficient if doctors and people who travel by boat know.

"sic" is much more like a treatment for circulatory shocks in this respect. It's a seldom needed feature (compared to <p> or <title> for example) but is nevertheless widespread. Introducing a hardly taught method for handling them will certainly lead to many cases not being treated with the method.

> If it turns out people are inclined to use the feature (e.g. if a microformat
> or whatever gains surprisingly common currency), we can reconsider adding it to
> the core vocabulary.

I'd agree with you if nobody would indicate correctly reproduced errors yet -- then we'd have to see if anyone is interested in it at all. But the usage of plaintext "[sic]" or hidden comments conveying the same information *already* shows that people are inclined to use something to indicate correctly reproduced errors. Asking them to change to use something that is more complicated first, and iff enough of them have adapted to the new complicated method, ask them to change again to a simpler method, seems weird.

> And you're citing examples
> where people want to make a visible indication of an error, but proposing an
> element that you expect not to have a visible indication, so the element would
> not solve their problem, so they mostly wouldn't use it.

I was actually arguing for a semantic element. The element could be styled to be visible or styled to be invisible. I gave examples where one would most probably use the element without being visible to the common reader as well as such where it would be styled to be visible. In my opinion the visibility as a question of style would have to be handled by CSS or other styling languages then.


More information about the whatwg mailing list