[whatwg] sic element, was: Re: Exposing spelling/grammar suggestions in contentEditable

Fri Dec 31 07:17:05 PST 2010

Am 30.12.2010 um 22:49 schrieb Benjamin Hawkes-Lewis:

> On Thu, Dec 30, 2010 at 8:55 PM, Martin Janecke <whatwg.org at kaor.in> wrote:
>> I don't think <mark> is appropriate for what I meant.
>> 
>> I as the publisher usually don't mean[1] to point a readers attention at spelling errors by someone I quote, I just want to be able to add semantic markup that identifies a part of text as deliberately published just the way it is published.
> 
> Indicating where mistakes have been reproduced in transcribed or
> quoted text seems like a different usage than Charles's application of
> marking mistakes in editable text for potential correction by the
> end-user.

Indeed. My reply which Hixie referred to was to one aspect of a different, more general proposal by Charles:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-November/029228.html

> 1. What problem(s) does indicating where mistakes have been reproduced solve?

I understand the question in this context as a concrete formulation of questions such as "What problem(s) does meta data solve? What problem(s) does semantic markup solve?" They carry additional information about a text. They solve the problem of not having this information available. Is the additional information worthwhile in this special case? I think so. It's common in plain text ("[sic]") and even spoken language. It's found in scientific papers as well as in respected newspapers.

Apart from informing human readers about the correct reproduction of a misspelled word, a HTML <sic> would indicate the same to web applications. Think of a search engine, which, as one factor of their ranking algorithm, considers orthography and grammar in a page as quality factor. The search engine could be made to ignore (reasonably few) <sic>-marked errors in such an algorithm; i.e. not let <sic>-marked errors rank the page lower.

> 2. What other solutions to this problem might there be?

As you suggested: Use a plain text string "[sic]" after the reproduced error.
As you suggested: Use some kind of microformat or related technologies.
Use span with an unstandardized title or class.
Use a HTML comment.
However, I think these solutions are inferior. The explanation is below.

> 3. What's the advantage of using markup to do this rather than visible
> text like deadtree.

Sorry, I don't understand "deadtree". Is this an idiom?

> What's wrong with "The House of Representatives
> shall chuse [<span lang="la">sic</span>] their Speaker and other
> Officers"?

In many cases there's nothing wrong with a visible "[sic]". It has successfully been done for decades. And it will be in future. There's also nothing wrong with plain text in general; it has been used successfully for centuries and will be in future. There's nothing wrong with books that use presentation oriented markup either, e.g. italics when emphasizing. They have been printed successfully for centuries and will be in future.

What is wrong with "Cats [emphasized] are cute animals" or "<span style='text-style:italics'>Cats</span> are cute animals"  or "<span class='emphasized'>Cats</span> are cute animals" instead of "<em>Cats</em> are cute animals"? I don't think there's anything really wrong with either of these, but apparently people agreed that it's good to use a standardized markup language for markup, that semantic markup is a good thing and that simple markup is a good thing. <sic> in an HTML page would be simple, semantic and consequent HTML.

I think <sic> is a more HTMLish solution than a plain text "[sic]" -- just like <ul><li>...<li>... styled with list-style-type:decimal is more HTMLish than <div>1. ...<div>2. ...

The plain text string "[sic]" doesn't indicate where the start of the "[sic]"ed part of text is. That means it provides less information than <sic>...</sic>.

"[sic]" can't be handled with @media and CSS in general.

Note that you can very well style <sic> as "[sic]" with CSS, if that's the form of presentation you prefer:
sic:after {content:" [sic] "}

"[sic]" is hardly used in full quotes/transcriptions, although the advantages of using "[sic]" in short quotes apply to full quotes too. For example, here's a short quote that uses "[sic]" visibly:
http://en.wikipedia.org/wiki/Article_One_of_the_United_States_Constitution#Clause_5:_Speaker_and_other_officers.3B_Impeachment
And here's a transcription that doesn't use "[sic]" in the same place although its publisher considered it important to indicate the correct reproduction of the original source in some way as well, as you can tell by looking into the wiki markup source code, where he added a comment stating the fact:
http://en.wikisource.org/wiki/Constitution_of_the_United_States_of_America#Section_2
Having "[sic]" numerous times in a text seems to be annoying. It puts too much emphasis on errors. It is easily misunderstood as ridiculing someone's orthography though often not intended. Also, readers use full text quotes for various purposes, e.g. printing a piece of poetic art out and pinning it to a wall just like a painting. Printed "[sic]"s are not desirable there, as they are not part of the art. An unobtrusive <sic> would preserve the advantages of "[sic]" without its disadvantages in full quotes. It carries its information even if made invisible to the common reader. Unlike HTML comments, which are also invisible, <sic> is semantic, can be easily made visible, and isn't stripped by processing scripts without good reason.

> 4. It seems like "sic" would be a very rarely used feature. Why do we
> need to include it in the small, core HTML vocabulary rather than an
> RDF vocabulary imported into HTML via annotations like RDFa,
> microdata, or microformats?

<sic> would be a natural enhancement in the tradition of <blockquote>, <q> and <cite>.

HTML is a widely taught and learned language.
Indicating where mistakes have been reproduced deliberately is a widely known and widely (though not very often) applied habit, even in spoken language and plain text.

Extensions such as microformats are less widely known and probably always will be. Because they build upon languages such as HTML, people won't learn microformats without the language they are used upon, but many people will learn the language they are used upon without learning microformats. Microformats are great to solve very specific problems and people seeking to solve specific problems will dig into them happily. But indicating where mistakes have been reproduced deliberately isn't a special interest/topic/technology application. It's a very basic thing to do, it occurs whenever quoting occurs. Almost every blogger does it. People on discussion boards quote each other all the time. Newspapers do it, scientific papers do it.

Thanks
Martin