[whatwg] several messages about HTML5 -- authors' tools

Thu Feb 22 05:14:25 PST 2007

Interesting thread (including various sub-ravels thereof).

Suppose in a semantically charged, but markup-impoverished medium such as 
the textual narrative (constituting the majority of the content of the web 
as we know it), we seek to build the word processor that generates not only 
the surface structure (the sentences and paragraphs) but the semantic 
structure as well. How do we minimize the author's effort? Authors will not 
want to write both their utterances and the translation of those utterances 
into semantic tags -- it is simply too labor intensive (unless we care more 
about form than substance and chose to purge ill-formed ideas from the human 
corpus).

Rather we may seek a word-processor that deduces semantics from authors' 
expressions. Yeah, without the existence of full-blown AI (which has been a 
while in coming, now), 40% (or so) of such deductions will be incorrect. But 
suppose following the creation of a sentence, or a paragraph, or a larger 
chunk of text, semantically enabled software were to pose to the author a 
finite (and hopefully small) number of "deductive disambiguators"?

"Dear author, did you mean to imply that AI will indeed, be arriving soon?"

"Dear author, who exactly does 'we' refer to in the above paragraph -- I'm 
sorry, but I see no people mentioned before?"

Using a relatively simply inference engine (SFOL + set theory + predicate 
calculus + arithmetic + time + causality + modal logic) coupled with 
thesauri and parsers (all available client-side these days), and (most 
importantly) the author's expert intervention,  I rather suspect that the 
40% (incorrect deductions) could be brought down to 8% with an additional 
cost of 20% in authorial time investment. With current software that most 
folks use, and requiring authors to generate their own semantics, I think we 
might expect to achieve 5% spurious deduction with 400% additional 
investment of authors' time. The cost-benefit ratio is just too high with 
current desktop tools.

In semantically impoverished (not in the evocative space it engenders, but 
in the surface expression of its utterance) but markup-rich environments 
such as SVG, the generation of a parallel semantic substrate is going to be 
a lot more difficult, but maybe that's why we have things like sXBL: to 
allow semantics to be imported from other disciplines.

That's one approach. Another is to build a semantic expression system for 
which we abandon our native languages and agree to write in a semantic 
shorthand (with lots of parentheses, by the way). For even one language, the 
task of finding a minimal set of semantic primitives (from its monolingual 
dictionary) is NP-complete, but if we seek such a shorthand to span the 
space of human semantics, it may take longer to bring into existence than AI 
itself . The different language families I have looked at probably share a 
core semantics of only about 20% of the expressive space of any one language 
by itself. The nice thing about such languages is that people from different 
linguistic backgrounds can all read the same text; the hassle is that it's 
hard to translate ordinary expressions into such languages.

cheers,
David Dailey

----- Original Message ----- 
From: "Elliotte Harold" <elharo at metalab.unc.edu>
To: "Ian Hickson" <ian at hixie.ch>
Cc: <whatwg at lists.whatwg.org>; "Vlad Alexander (xhtml.com)" 
<vlad.alexander at xhtml.com>
Sent: Wednesday, February 21, 2007 4:34 PM
Subject: Re: [whatwg] several messages about HTML5

> Ian Hickson wrote:
>
>> The original reason I got involved in this work is that I realised that 
>> the human race has written literally billions of electronic documents, 
>> but without ever actually saying how they should be processed.
>
> That's a feature, not a bug.
>
>> If, in a thousand years, someone found a trove of HTML documents and 
>> decided they
>> would right an HTML browser to view them, they couldn't do it! Even with 
>> the existing HTML specs -- HTML4, SGML, DOM2 HTML, etc -- a perfect 
>> implementation couldn't render the vast majority of documents as they 
>> were originally intended.
>>
>
> Authorial intent is a myth. Documents don't have to be rendered like the 
> author intended, nor should we expect them to be.  We don't read Homer 
> like Homer intended, but we still read him, well more than a thousand 
> years later. (For one thing Homer actually intended that people listen to 
> the poems, not read them.)
>
> This is not to say that I don't think it's useful to define a standard 
> tree structure for documents. It is useful. However the benefit of this 
> exercise is not in maintaining authorial intent. That's tilting at 
> windmills, and will never succeed no matter what we do.
>
> -- 
> Elliotte Rusty Harold  elharo at metalab.unc.edu
> Java I/O 2nd Edition Just Published!
> http://www.cafeaulait.org/books/javaio2/
> http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/
>
>