[whatwg] Hyphenation

Øistein E. Andersen html5 at xn--istein-9xa.com
Mon Jan 8 15:02:54 PST 2007


Hyphenation does not seem to have been discussed on this list so far, and I think
it should be.


General discussion:
    [1] http://www.w3.org/International/O-HTML-hyphenation.html

Old proposal:
    [2] http://www.nada.kth.se/i18n/html/hyph.html

Babel (LaTeX i18n package) documentation:
    [3] ftp://tug.ctan.org/pub/tex-archive/macros/latex/required/babel/user.pdf

Unicode Technical Report #14 -- Line Breaking Properties:
    [4] http://www.unicode.org/reports/tr14/tr14-6.html


In summary, hyphenation is a hard problem: breaking points cannot in general
be established algorithmically; hyphenation dictionaries are not always available
and typically do not contain long/rare/complex words (the ones that really
need to be hyphenated); furthermore, distinct words may be spelt identically,
but still need to be hyphenated differently; and several languages require spelling
changes when words are hyphenated ([3] mentions Dutch, German (alte
Rechtschreibung), Spanish, Norwegian, Swedish and Hungarian).

The controversy surrounding the meaning of ­ (U+00AD) is probably over,
although Opera currently seems not to render this character in accordance with
Unicode (IE7 and Safari seem to do the right thing; Firefox does not hyphenate
at all).

[4] contains the following passage:
> SHY is rendered invisibly and has no width, except at a line break. The
> rendering of the soft hyphen depends on the script. For the Latin script
> it is rendered as a hyphen, however, some languages require a change
> in spelling surrounding an optional hyphen, if it occurs at a line break.
> For example in Swedish the word “tuggummi” changes to “tugg-gummi”
> when hyphenated.

It is not clear to me how this last point is supposed to be implemented in practice,
however. (It is certainly  n o t  the case that `gg' should be hyphenated `gg-
g' in  a l l  Swedish words.)

The proposal [2] suggests the addition of a new <hyph> element, modelled after
TeX's \discretionary command (with a possibly superfluous addition), that permits
to specify which characters to render before/after a line break if the word is broken.

Currently, hyphenation and justification are scarce on the Web, and the average
blogger hardly misses these features. If, however, writing books in HTML
(as mentioned on this list) is to become commonplace, these issues must be
dealt with somehow, and explicit markup seems to be unavoidable at least in some
cases.


I hope this can lead to a fruitful discussion.

-- 
Øistein E. Andersen



More information about the whatwg mailing list