[whatwg] Hyphenation

Thu Jan 11 05:00:16 PST 2007

Thanks for all the interesting comments so far.

On 9 Jan 2007, at 10:23AM, Anne van Kesteren wrote:

> "[...] simple cases could also be handled with a `soft hyphen' (&shy;), if browsers
> would only support it." which is of course not an excuse to go around and introduce
> a new element!

Indeed. Browser support has improved since that document was written, though.
Today, all major browsers except Firefox support the soft hyphen, and the 
purpose of a new element would be to enable more complex cases to be handled
properly, not to replace the soft hyphen.

On 9 Jan 2007, at 1:3PM, Leons Petrazickis wrote:

> Hyphenation is a presentational problem. [...] We should
> avoid embedding presentational hyphenation tags in the actual text.

Yes, if possible. The verb record is supposed to be hyphenated re-cord,
whilst the correct hyphenation of the noun is rec-ord. For this reason, TeX
never hyphenates record (unless the author writes rec\-ord or re\-cord).

This problem may be more common in other languages, but expecting authors
to hard-code hyphenation points in particular words is probably futile.

> I would suggest that the first priority is getting a naive hyphenator into browsers.

This would probably have to be language-specific, though. See comments on Prince
below.

> [To handle special cases,] I would suggest a hyphenation dictionary in the
> <head> of the document.

Not a bad idea. The problem is that the words requiring special attention will
depend on the particular «naïve» algorithm implemented, i.e., the browser...

On 9 Jan 2007, at 1:15PM, Alexey Feldgendler wrote:

> In some typographical traditions, non-full-justified text is sometimes
> hyphenated.

In the mechanical-typewriter era, a typist would certainly choose to hyphenate when
the bell sounded in the middle of a long word.

On 9 Jan 2007, at 1:37PM, Håkon Wium Lie wrote:

> Prince6 (www.princexml.com) supports these properties:
> 
>   hyphenate: none | auto
>   hyphenate-dictionary: none | url(...)
>   hyphenate-before: <int>
>   hyphenate-after: <int>
>   hyphenate-lines: none | <int>

>From http://www.princexml.com/howcome/2006/p6/p6demo2.html:

> Prince can read the hyphenation format pioneered by TeX and reused by many
> other applications. OpenOffice hosts a number of hyphenation dictionaries that
> are reusable in Prince6.

This is a great step forward. I hope something along these lines will find its way
into desktop browsers as well.

It should be noted, though, that — unless I have misunderstood something —
the `hyphenation dictionaries' are really patterns that allow to compute
hyphenation points. The particular method used in TeX was discovered by
Frank M. Liang about 25 years ago and implemented in TeX soon thereafter.
According to the TeXbook, the original US-English patterns find about 90%
of the hyphenation points given in a dictionary or about 95% of the permissible
hyphenation points in a typical text (where common words are more frequent)
without making any mistakes.

This is, however, only one part of TeX's hyphenation system. The next level is a
hyphenation exception dictionary, a list of fully hyphenated words that would not
otherwise be hyphenated correctly. (Plain) TeX contains a list of fourteen words
including `present' (which cannot be hyphenated without knowing whether it is a
noun or a verb, so TeX does not try) end `ta-ble' (a common word that would otherwise not be hyphenated at all), and the author can add words at any time
useing the \hyphenation command.

In addition to this, hyphenation can be indicated locally. This is needed in order to
hyphenate words like rec-ord/re-cord and is the only level that deals with
spelling changes. If The New Yorker were using TeX and wanted preëmptive to
hyphenate as pre-emptive, this rule could not be incorporated into either the
patterns or the exception dictionary. From an i18n perspective, the patterns and
(at the very least) the exception dictionary ought to allow not only insertion of
hyphens, but also spelling changes to be specified. The examples given so far in
this thread may not be convincing, but if it is true that l·l should in general hyphenate
as l-l in Catalan, this certainly is an important problem for that language, and there
are probably many similar issues in other languages that we just do not know about.

It seems that Prince currently uses TeX patterns, but no exception dictionary,
and allows local encoding of hyphenation points (&shy;), but not spelling changes.

There are a few additional caveats. For instance, it is not entirely obvious what
should be considered to be a `word' or which characters should be allowed in a
`word' (given that only `words' can be hyphenated using this kind of algorithms).
TeX uses `category codes' to define letters, and Unicode's character classes
give a good approximation, but they cannot be redefined to deal with specific
issues. In Italian, for instance, dell'opera should be hyphenated dell'o-
pera, but opera should not be hyphenated o-pera. (The particular example may
be wrong, but the principle is correct.) Unless the apostrophe is
considered to be a `letter' (a constituent of a `word'), correct patterns do not
help, as `dell'opera' will not be considered as one unit during hyphenation-point
look-up.

Another example worth mentioning is that Polish and a few other languages
apparently require a hyphenated word like xxx-yyy to be hyphenated xxx-
-yyy (with an extra hyphen carried over). A truly flexible system would allow
to specify, e.g., which non-letters to treat as part of words and which to give
special treatment. (As we all know, TeX hyphenates xxx-yyy as xxx-
yyy; in addition, the hyphen prohibits xxx and yyy from being hyphenated,
which may or may not be suitable depending on, e.g., column width.)

How does Prince deal with these issues?

On 9 Jan 2007, at 6:22PM, Henri Sivonen wrote:

> * Prince seems to be doing exactly the right thing: control overall hyphenation
> with CSS, honor soft hyphens and support TeX-compatible language-specific
> dictionaries.

Yes, albeit certain specific details could be improved slightly to work better with
foreign languages.

> * The Swedish and Dutch examples given in this thread seem to be addressable
> with language-specific dictionaries.

They would be with a suitable pattern and/or dictionary format.

> the interaction of the diaeresis with hyphenation may even be a generalizable rule
> that could be hard-coded in Dutch-aware hyphenating browsers.

Hard-coding such details as opposed to defining a proper format that allows
such things easily to be specified occurs to me as a bad idea.

> it looks like the Swedish rule is generalizable so that a hyphenator wouldn't even
> need a list of all possible compound words but a dictionary of simple words that
> can be part of a compound would suffice.

Well, yes, but (after verification) `tuggummi' is really composed of the verb
`tugga' (with a final -a) and the noun `gummi'. The -a ending disappears
in the compound, and `tugggummi' turns into `tuggummi' because triple
consonants are not allowed. Hard-coding this level of detail into browsers
is probably not ideal. Moreover, a German (alte Rechtschreibung) word like `Bettuch'
should be hyphenated `Bet-tuch' or `Bett-tuch' depending on the intended meaning.
(I do not argue that such very particular cases require a new HTML element to be
added immediately.)

> * Not having a language-specific dictionary available in a browser doesn't make
> things worse than the status quo, so it isn't that big a deal.

Hyphenating using `generic' (US English) rules would actually be worse than
abstention, but this is probably not what you mean.

> * Hand-coders wouldn't bother to type hyphenation data for everything every time.
> * It is unlikely that authoring tools would opt to dump their hyphenation data
> in documents

I suppose so. An external format would be preferable.

> * All the languages cited as requiring spelling changes are written using the Latin script.

This may well be due to my and others' cultural bias.

On 11 Jan 2007, at 12:50AM, Sander Tekelenburg wrote:

> FWIW, my feeling is that it would be best if there'd be a defined format for
> hyphenation rules, and browsers would accept such description files as a
> plug-in.

On 11 Jan 2007, at 1:19AM, Håkon Wium Lie wrote:

> This format exists. It was pioneered by TeX and is now widely used by
> other applications.

You seem to be referring to TeX's hyphenation patterns, which are only one
(important) part of TeX's hyphenation system. The missing parts need to be
defined somehow, and a certain generalisation would be welcome, as discussed
above.

-- 
Øistein E. Andersen