[whatwg] sic element

Mon May 2 15:26:09 PDT 2011

On Thu, 30 Dec 2010, Martin Janecke wrote:
> Am 30.12.2010 um 02:47 schrieb Ian Hickson:
> > On Tue, 30 Nov 2010, Martin Janecke wrote:
> >> 
> >> I support this idea and I'd certainly use it. For example, I'm 
> >> currently copying an old rhyme book to hypertext and would love to 
> >> mark historically correct (but now incorrect) spelling, spelling 
> >> intentionally done wrong for better rhyming (yes, people did this in 
> >> the past) and unintentional errors from the book semantically. I 
> >> think it is important to note where those errors are done intentional 
> >> (by me, the publisher of the web page) in contrast to errors 
> >> accidentally added by me that differ from the copied book.
> > 
> > <mark> is the element for this purpose.
> 
> I don't think <mark> is appropriate for what I meant.
> 
> I as the publisher usually don't mean[1] to point a readers attention at 
> spelling errors by someone I quote, I just want to be able to add 
> semantic markup that identifies a part of text as deliberately published 
> just the way it is published. Here's an example of a webpage quoting the 
> US constitution 
> http://en.wikisource.org/wiki/Constitution_of_the_United_States_of_America#Section_2:
> 
> "The House of Representatives shall chuse their Speaker and other 
> Officers"
> 
> I'd like to be able to code this as
> 
> "The House of Representatives shall <sic>chuse</sic> their Speaker and 
> other Officers"
> 
> to record that I intentionally wrote "chuse", not "choose", as "chuse" 
> is exactly what the constitution says.

Ah, I see. I misunderstood your original use case; I thought you meant 
that you wanted to bring these historical artefacts to the reader's 
attention, not that you wanted to just mark them as intentional.

On Fri, 31 Dec 2010, Martin Janecke wrote:
> 
> I understand the question in this context as a concrete formulation of 
> questions such as "What problem(s) does meta data solve? What problem(s) 
> does semantic markup solve?" They carry additional information about a 
> text. They solve the problem of not having this information available. 
> Is the additional information worthwhile in this special case? I think 
> so. It's common in plain text ("[sic]") and even spoken language. It's 
> found in scientific papers as well as in respected newspapers.

This suggests "[sic]" might be sufficient to solve the problem of not 
having the information available.

In cases where you want to note this but not make it visible to the reader 
unless they study it carefully, maybe "<span title="sic">...</span>" would 
be better than "[sic]", that also solves the problem of not having the 
information available.

> Think of a search engine, which, as one factor of their ranking 
> algorithm, considers orthography and grammar in a page as quality 
> factor. The search engine could be made to ignore (reasonably few) 
> <sic>-marked errors in such an algorithm; i.e. not let <sic>-marked 
> errors rank the page lower.

Should a search engine have that problem, we can consider it, but if it's 
just a theoretical problem at the moment it's best not to solve it.

> > What's wrong with "The House of Representatives shall chuse [<span 
> > lang="la">sic</span>] their Speaker and other Officers"?
> 
> In many cases there's nothing wrong with a visible "[sic]". It has 
> successfully been done for decades. And it will be in future. There's 
> also nothing wrong with plain text in general; it has been used 
> successfully for centuries and will be in future. There's nothing wrong 
> with books that use presentation oriented markup either, e.g. italics 
> when emphasizing. They have been printed successfully for centuries and 
> will be in future.
> 
> What is wrong with "Cats [emphasized] are cute animals" or "<span 
> style='text-style:italics'>Cats</span> are cute animals"  or "<span 
> class='emphasized'>Cats</span> are cute animals" instead of 
> "<em>Cats</em> are cute animals"?

Well the first is unfamiliar to readers as a typographic style, the second 
is media-specific and so wouldn't work for e.g. speech synthesis, whereas 
<em> would, and the third requires the author to additionally provide some 
CSS to convey the semantic, which is problematic since the CSS layer is 
intended to be optional.

> The plain text string "[sic]" doesn't indicate where the start of the 
> "[sic]"ed part of text is. That means it provides less information than 
> <sic>...</sic>.

Is this a real problem? Surely most people would easily be able to 
determine that the scope of [sic] is simply the previous "error".

> "[sic]" can't be handled with @media and CSS in general.

Why is this a problem?

> "[sic]" is hardly used in full quotes/transcriptions, although the advantages of using "[sic]" in short quotes apply to full quotes too. For example, here's a short quote that uses "[sic]" visibly:
> http://en.wikipedia.org/wiki/Article_One_of_the_United_States_Constitution#Clause_5:_Speaker_and_other_officers.3B_Impeachment
> And here's a transcription that doesn't use "[sic]" in the same place 
> although its publisher considered it important to indicate the correct 
> reproduction of the original source in some way as well, as you can tell 
> by looking into the wiki markup source code, where he added a comment 
> stating the fact:
> http://en.wikisource.org/wiki/Constitution_of_the_United_States_of_America#Section_2

Both of these are possible in HTML (inline "[sic]" and a  
comment respectively).

> Having "[sic]" numerous times in a text seems to be annoying. It puts 
> too much emphasis on errors. It is easily misunderstood as ridiculing 
> someone's orthography though often not intended. Also, readers use full 
> text quotes for various purposes, e.g. printing a piece of poetic art 
> out and pinning it to a wall just like a painting. Printed "[sic]"s are 
> not desirable there, as they are not part of the art. An unobtrusive 
> <sic> would preserve the advantages of "[sic]" without its disadvantages 
> in full quotes. It carries its information even if made invisible to the 
> common reader. Unlike HTML comments, which are also invisible, <sic> is 
> semantic, can be easily made visible, and isn't stripped by processing 
> scripts without good reason.

Why is the information being missing in this case a problem?

On Sat, 30 Apr 2011, Martin Janecke wrote:
>
> I've been convinced that the there's not enough need for a <sic> element 
> to introduce one, mostly by
> 
> Tab Atkins Jr. 
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-December/029585.html 
> and Benjamin Hawkes-Lewis 
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-December/029586.html

Great! :-)

> Am 31.12.2010 um 17:30 schrieb Benjamin Hawkes-Lewis:
> > 
> > Semantic markup is a tool HTML uses to solve problems. The sort of 
> > problem statements we're looking for are things more like this. 
> > End-users need to find information within complicated pages quickly. 
> > By marking up headings semantically we allow users to scan the page 
> > visually, or select a heading from a list, or jump to the next heading 
> > with a shortcut key.
> 
> I see. Nevertheless, I don't really understand the necessity to explain 
> why one would indicate where mistakes have been reproduced in a quoted 
> text. It's a common thing to do and sometimes it is actually important 
> to do it. I think we all know why text is sic'ed, don't we?

Generally we like to make sure we explicitly determine what problem we're 
solving because more often than not we have thought it was obvious but 
later found we all had different ideas and ended up designing something 
that simply doesn't work.

I think we use "[sic]" as a way for one human to tell another human that 
they are aware that the text has a mistake but that keeping the mistake 
was intentional, so that the other human won't tell the first human to fix 
the problem. For this, plaintext "[sic]" seems to solve the problem quite 
adequately.

> >> Apart from informing human readers about the correct reproduction of 
> >> a misspelled word, a HTML <sic> would indicate the same to web 
> >> applications. Think of a search engine, which, as one factor of their 
> >> ranking algorithm, considers orthography and grammar in a page as 
> >> quality factor. The search engine could be made to ignore (reasonably 
> >> few) <sic>-marked errors in such an algorithm; i.e. not let 
> >> <sic>-marked errors rank the page lower.
> > 
> > Would search engines benefit from markup for this?
> > 
> > Seems to me it would be fairly easy for an search engine to spot plain 
> > text "[sic] and act accordingly.
> 
> Sic'ing text is a concept independent of the human language the sic'ed 
> text is written in. Writing "[sic]" in plain text is – although common 
> in many languages – not used universally in all languages. I'm sure it's 
> expressed differently in Japanese, for example. Recognizing an element 
> is easier than recognizing many different ways to express this in 
> different languages.

Before we can determine that for sure, we would need to know what those 
other mechanisms are, or at least we need to determine that there are 
enough such mechanisms that it makes more sense to introduce a new element 
that just use the existing mechanisms.

> >> The plain text string "[sic]" doesn't indicate where the start of the 
> >> "[sic]"ed part of text is. That means it provides less information 
> >> than <sic>...</sic>.
> >> 
> >> "[sic]" can't be handled with @media and CSS in general.
> > 
> > Why does it need to be? Is applying different styling to indicate 
> > mistakes in the original actually a common publisher need (unlike 
> > being able to style headings or block quotations, for example)?
> 
> Including the plaintext "[sic]" in a text – without meaning to change 
> the textual information of the original text – *is* a applying a special 
> style to indicate mistakes in the original source. That's indeed a 
> common publisher need. The introduction of a <sic> element would allow 
> publishers to handle as a style what always has been a style.

It's no more a "style" than ending a question with a "?" or putting an 
aside in parentheses (like this). I'd say it's not even as stylistic as 
quote marks, something for which we have the <q> element now, but for 
which I think the general consensus is that we would have been better off 
not bothering with an element at all.

> >>> 4. It seems like "sic" would be a very rarely used feature. Why do 
> >>> we need to include it in the small, core HTML vocabulary rather than 
> >>> an RDF vocabulary imported into HTML via annotations like RDFa, 
> >>> microdata, or microformats?
> > 
> >> Extensions such as microformats are less widely known and probably 
> >> always will be.
> > 
> > Agreed. This seems appropriate for features that would be rarely used.
> 
> I disagree. Let me explain this by means of an analogy.
> 
> A circulatory shock is a rare medical problem for a single person. I 
> wouldn't say it was appropriate that only few people knew how to do 
> first aid in cases of shocks, though. If only few people knew how to 
> handle shocks, they'd hardly be treated correctly. Although shocks are 
> rare, they are a widespread phenomenon.
> 
> On the other hand, occurrences of seasickness are less widespread, 
> although much less rare in certain areas (i.e. at sea). It's definitely 
> not necessary for everyone to know how to treat seasickness. It's 
> sufficient if doctors and people who travel by boat know.
> 
> "sic" is much more like a treatment for circulatory shocks in this 
> respect. It's a seldom needed feature (compared to <p> or <title> for 
> example) but is nevertheless widespread. Introducing a hardly taught 
> method for handling them will certainly lead to many cases not being 
> treated with the method.

I don't think use of "[sic]" is widespread. We should certainly do more 
research if this was a key point in this argument, but I would be very 
surprised if most people used "[sic]".

> > If it turns out people are inclined to use the feature (e.g. if a 
> > microformat or whatever gains surprisingly common currency), we can 
> > reconsider adding it to the core vocabulary.
> 
> I'd agree with you if nobody would indicate correctly reproduced errors 
> yet -- then we'd have to see if anyone is interested in it at all. But 
> the usage of plaintext "[sic]" or hidden comments conveying the same 
> information *already* shows that people are inclined to use something to 
> indicate correctly reproduced errors.

How often, though?

> Asking them to change to use something that is more complicated first, 
> and iff enough of them have adapted to the new complicated method, ask 
> them to change again to a simpler method, seems weird.

Generally speaking, such features are only used when there is no simpler 
solution. When there's a simpler solution, we just use it. :-)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'