[whatwg] Semantic styling languages in the guise of HTMLattributes.
jg307 at cam.ac.uk
Wed Dec 27 16:58:28 PST 2006
Mike Schinkel wrote:
> Matthew Paul Thomas wrote:
>> On Dec 22, 2006, at 3:23 AM, Benjamin Hawkes-Lewis wrote:
>>> Henri Sivonen wrote:
>>>> Also, it seems to me that the usefulness of non-heuristic machine
>>>> consumption of semantic roles of things like dialogs, names of
>>>> vessels, biological taxonomical names, quotations, etc. has been
>>>> vastly exaggerated.
>>> I'm not entirely sure what "non-heuristic machine consumption" is,
>> An example of non-heuristic machine consumption is where
>> Google Glossary thinks: "In an HTML 3.2 or earlier document
>> containing the code '<dl><dt>foo<dt> <dd>bar</dd></dl>',
>> 'bar' is a definition of 'foo'". (It probably thinks the same
>> about HTML 4 documents, too, which is applying a small
>> "ignore that nonsense about dialogues" heuristic.)
>> An example of heuristic machine consumption is where Google Glossary
>> thinks: "In an HTML document containing the code
>> '<p><b>foo:</b> bar</p>', 'bar' is probably a definition of
>> 'foo', especially if the page has several consecutive
>> paragraphs with that structure and different bold text."
>> Non-heuristic machine consumption fails when semantic
>> elements are abused, and becomes practical when elements have
>> multiple popular meanings (examples of the latter include
>> <dl> in HTML 4, and <p> in HTML 5). Heuristic machine
>> consumption fails occasionally by the very nature of
>> heuristics (examples currently include
>> <http://www.google.com/search?q=define:author> and
> The origin of this thread was my request for adding attributes to all
> elements to support microformat-like semantic markup. Based on the context
> of your reply, it seems you are agreeing with Matthew Raymond in his
> assertion that using microformat-like semantic markup is A Bad Thing(tm). Am
> I understanding your position correctly? (If I'm not, please forgive me.)
Actually, IMHO mpt's point is far broader and consequentially more
important than the confines of the original thread. The point, as I
understand it, is that machine analysis of "semantic" markup fails if
the markup construct is (ab)used in so many different ways that the
interpretation of any particular fragment is no longer unambiguous. This
is a sort of "heat death" of the original semantics; as the use of an
element becomes increasingly disordered (i.e. higher entropy), it
becomes impossible to extract any useful information from the use of
that element. This is critical in the proper design of semantic markup
languages because one wishes to stave off the heat death as long as
possible so that, as far as possible, UAs can perform useful functions
based on the information in the markup (e.g. render it to a media for
which the content was not explicitly designed). Obviously I don't know
how to achieve this but there are a few things to consider:
* Have enough elements. If there are obvious holes that people can't
fill with existing elements used properly, they will reuse existing
elements in new ways so increasing their entropy.
* Don't have too many elements: If there are too many elements people
won't understand them all and will reuse existing elements in the
"wrong" way, so increasing their entropy.
* Make the semantics of elements well defined: Start the elements in a
"low entropy" i.e. highly ordered state. Make it obvious how the element
is intended to be used (and restrict the valid uses to ones that can be
discriminated by machine) so that fewer people accidentally abuse it.
* Have some "high entropy" elements. This is the counterintuitive one.
The goal, remember, is to extract as much information as possible from
the semantically well-defined elements. However, in many situations
there will not be a relevant element to use, the publishing setup will
not be optimized for selecting the correct semantic element (think
WYSIWYG editors), or the author will not be sufficiently familiar with
the language semantics to make a well-informed choice about the right
element to use. In this case providing (and encouraging the use of!) a
set of high entropy "bit-bucket" elements that are semantically
meaningless is very beneficial because they prevent the entropy
increase associated with the abuse of the semantic elements. The
increasing misuse of <em> as a "more semantic" <i> is an example of what
happens when this policy is not followed.
* Allow easy extensions. Having an extension mechanism for those who
need more functionality is one way to stop the abuse of existing
elements. This has to be sufficiently easy to use that the it can be
widely adopted but powerful enough that it can replicate all the
semantic features of the host language.
This post was brought to you by the society for dodgy physical analogies
concocted in the middle of the night.
 Or, if you like, "Entropy death". Of course, this has nothing to do
with real physical entropy but a lot to do with the common association
between the second law of thermodynamics and the concept of disorder.
More information about the whatwg