[whatwg] Semantic styling languages in the guise of HTMLattributes.

Wed Dec 27 16:58:28 PST 2006

Mike Schinkel wrote:
> Matthew Paul Thomas wrote:
>   
>> On Dec 22, 2006, at 3:23 AM, Benjamin Hawkes-Lewis wrote:
>>     
>>> Henri Sivonen wrote:
>>> ...
>>>       
>>>> Also, it seems to me that the usefulness of non-heuristic machine 
>>>> consumption of semantic roles of things like dialogs, names of 
>>>> vessels, biological taxonomical names, quotations, etc. has been 
>>>> vastly exaggerated.
>>>>         
>>> I'm not entirely sure what "non-heuristic machine consumption" is,
>>>       
>> An example of non-heuristic machine consumption is where 
>> Google Glossary thinks: "In an HTML 3.2 or earlier document 
>> containing the code '<dl><dt>foo<dt> <dd>bar</dd></dl>', 
>> 'bar' is a definition of 'foo'". (It probably thinks the same 
>> about HTML 4 documents, too, which is applying a small 
>> "ignore that nonsense about dialogues" heuristic.)
>>
>> An example of heuristic machine consumption is where Google Glossary
>> thinks: "In an HTML document containing the code 
>> '<p><b>foo:</b> bar</p>', 'bar' is probably a definition of 
>> 'foo', especially if the page has several consecutive 
>> paragraphs with that structure and different bold text."
>>
>> Non-heuristic machine consumption fails when semantic 
>> elements are abused, and becomes practical when elements have 
>> multiple popular meanings (examples of the latter include 
>> <dl> in HTML 4, and <p> in HTML 5). Heuristic machine 
>> consumption fails occasionally by the very nature of 
>> heuristics (examples currently include 
>> <http://www.google.com/search?q=define:author> and
>> <http://www.google.com/search?q=define:editor>.)
>>     
>
> The origin of this thread was my request for adding attributes to all
> elements to support microformat-like semantic markup. Based on the context
> of your reply, it seems you are agreeing with Matthew Raymond in his
> assertion that using microformat-like semantic markup is A Bad Thing(tm). Am
> I understanding your position correctly? (If I'm not, please forgive me.)
>   
Actually, IMHO mpt's point is far broader and consequentially more 
important than the confines of the original thread. The point, as I 
understand it, is that machine analysis of "semantic" markup fails if 
the markup construct is (ab)used in so many different ways that the 
interpretation of any particular fragment is no longer unambiguous. This 
is a sort of "heat[1] death" of the original semantics; as the use of an 
element becomes increasingly disordered (i.e. higher entropy), it 
becomes impossible to extract any useful information from the use of 
that element. This is critical in the proper design of semantic markup 
languages because one wishes to stave off the heat death as long as 
possible so that, as far as possible, UAs can perform useful functions 
based on the information in the markup (e.g. render it to a media for 
which the content was not explicitly designed). Obviously I don't know 
how to achieve this but there are a few things to consider:

* Have enough elements. If there are obvious holes that people can't 
fill with existing elements used properly, they will reuse existing 
elements in new ways so increasing their entropy.

* Don't have too many elements: If there are too many elements people 
won't understand them all and will reuse existing elements in the 
"wrong" way, so increasing their entropy.

* Make the semantics of elements well defined: Start the elements in a 
"low entropy" i.e. highly ordered state. Make it obvious how the element 
is intended to be used (and restrict the valid uses to ones that can be 
discriminated by machine) so that fewer people accidentally abuse it.

* Have some "high entropy" elements. This is the counterintuitive one. 
The goal, remember, is to extract as much information as possible from 
the semantically well-defined elements. However, in many situations 
there will not be a relevant element to use, the publishing setup will 
not be optimized for selecting the correct semantic element (think 
WYSIWYG editors), or the author will not be sufficiently familiar with 
the language semantics to make a well-informed choice about the right 
element to use. In this case providing (and encouraging the use of!) a 
set of high entropy "bit-bucket" elements that are semantically 
meaningless is  very beneficial because they prevent the entropy 
increase associated with the abuse of the semantic elements. The 
increasing misuse of <em> as a "more semantic" <i> is an example of what 
happens when this policy is not followed.

* Allow easy extensions. Having an extension mechanism for those who 
need more functionality is one way to stop the abuse of existing 
elements. This has to be sufficiently easy to use that the it can be 
widely adopted but powerful enough that it can replicate all the 
semantic features of the host language.

This post was brought to you by the society for dodgy physical analogies 
concocted in the middle of the night.

[1] Or, if you like, "Entropy death". Of course, this has nothing to do 
with real physical entropy but a lot to do with the common association 
between the second law of thermodynamics and the concept of disorder.