[whatwg] Hyphenation
Henri Sivonen
hsivonen at iki.fi
Tue Jan 9 10:22:28 PST 2007
On Jan 9, 2007, at 01:02, Øistein E. Andersen wrote:
> In summary, hyphenation is a hard problem: breaking points cannot
> in general
> be established algorithmically; hyphenation dictionaries are not
> always available
> and typically do not contain long/rare/complex words (the ones that
> really
> need to be hyphenated); furthermore, distinct words may be spelt
> identically,
> but still need to be hyphenated differently; and several languages
> require spelling
> changes when words are hyphenated ([3] mentions Dutch, German (alte
> Rechtschreibung), Spanish, Norwegian, Swedish and Hungarian).
My initial thoughts:
* Prince seems to be doing exactly the right thing: control overall
hyphenation with CSS, honor soft hyphens and support TeX-compatible
language-specific dictionaries.
* The Swedish and Dutch examples given in this thread seem to be
addressable with language-specific dictionaries.
* Not knowing Dutch, the example makes me guess that the diaeresis
in Dutch has the same meaning as in French (indicate that vowels
don't form a diphthong). If this is the case, the interaction of the
diaeresis with hyphenation may even be a generalizable rule that
could be hard-coded in Dutch-aware hyphenating browsers. Is it a
generalizable rule?
* Knowing a bit Swedish, I really have a hard time taking seriously
the notion of Swedish requiring new markup to be introduced to HTML.
The sky won't fall if a browser doesn't know how to hyphenate Swedish
chewing gum in the absence of a hyphenation dictionary. (Besides, it
looks like the Swedish rule is generalizable so that a hyphenator
wouldn't even need a list of all possible compound words but a
dictionary of simple words that can be part of a compound would
suffice.)
* Not having a language-specific dictionary available in a browser
doesn't make things worse than the status quo, so it isn't that big a
deal.
* Hand-coders wouldn't bother to type hyphenation data for
everything every time. (TeX users run the typesetting step themselves
whereas HTML is rendered elsewhere. TeX users only tend to
micromanage the words that they see didn't typeset nicely.)
* It is unlikely that authoring tools would opt to dump their
hyphenation data in documents even if their data was in a format
suitable for dumping in whatever format was required.
* All the languages cited as requiring spelling changes are written
using the Latin script. The Latin script has a long cultural
tradition of adapting to writing technology: from chiseled marble to
quills to movable type to typewriters to computer displays.
Therefore, I don't find it unreasonable to suggest adapting to the
limitations of the medium here.
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list