[html5] Usefulness of language annotations

Mon Aug 11 10:38:20 PDT 2014

2014-08-11 20:09, Jens O. Meiert wrote:
> Hence I want to fish for arguments here: How useful are language
> annotations via @lang?

They are of rather limited usefulness in practice. There has been much 
talk about them in theory for several years, but it has mostly remained 
theory. They are recommended in WCAG, but few actual benefits have been 
cited.

I think HTML specifications should just describe @lang as it has been 
defined, as declarative markup, without recommending or discourageing 
its use. The reason is that its real usefulness is a separate issue that 
varies according to what browsers, search engines, and other software 
actually do with it. And this probably depends on things external to 
HTML specifications.

> 1) Do user agents, including assistive technology, use this
> information in a way that is *actually* relevant and meaningful to the
> user?

They do, to some extent.

> 2) Isn’t, or shouldn’t, language determination primarily be made a
> user agent, and not a developer responsibility?

At the logical level, specifying content language is the author’s (or 
“developer’s”) responsibility. To take this to the extreme, consider an 
HTML document where the only text content is the word “hat”. The entire 
meaning depends on the intended language. If you declare, say, lang=sv, 
the content means “hate”; if lang=de, it means “has”. If this sounds 
contrived, consider an HTML document consisting just of an image and its 
caption, which can be very short.

In practice, browsers don’t try to determine content language from the 
content itself. Some search engines do. I think it is well known that 
Google ignores @lang, because it is so often just wrong (e.g., lang=”en” 
emitted by authoring software, with no regard to the actual content 
language), and can usually guess the language from a sentence or two 
pretty well. It sometimes makes mistakes, e.g. taking Norwegian for 
Danish or vice versa, or Slovak for Czech or vice versa, but the 
important thing is that it works for the vast majority of cases.

> 3) Does it matter at all?

For a document as a whole, automatic language guessing works 
sufficiently well. When it does not (e.g. in a speech browser with no 
such guessing), the user needs to select the reading mode manually. 
Inconvenient, but usually not a big issue

The specific situation where @lang might matter is change of language in 
a multilingual document. Language guessing is easily misled if there are 
short quotations in other languages. You can see this if you use 
Microsoft Word (which has a good language guesser) and write in 
different languages in the same document, with language guessing 
enabled. Word usually guesses right, except for short fragments in 
another languages, and it may interpret the exact location of language 
change. So language markup could help. The problem is that it is very 
tedious to produce, and the potential gain is rather small now, and in 
the foreseeable future.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/