[html5] Identifying HTML 5 documents? (vs. alternate flavors)

Fri Feb 8 01:44:49 PST 2008

On Feb 4, 2008, at 18:39, Jim Correia wrote:
> On Feb 4, 2008, at 11:24 AM, Henri Sivonen wrote:
>> On Feb 4, 2008, at 17:28, Jim Correia wrote:
>>
>>> I know there has been some discussion about this on the forum. But
>>> after having read through the draft spec and the FAQ, I'm still a
>>> little unclear about how I can auto-detect that a document is using
>>> HTML 5.
>>
>> The short answer is that HTML5 by design tries to discourage you  
>> from trying to do that.
>
> I can understand that discouraging user-agents from doing this might  
> be a good thing. At the same time, it appears to make life more  
> difficult for those of us who produce authoring tools which must  
> support legacy formats alongside HTML 5.

If the spec had a centrally-prescribed way for authoring tools to do  
spec versioning, people would be tempted to suggest all sorts of  
version-based conditional behavior in browsers.

>>> (Or more precisely, that the author of the document intended
>>> it to be conformant to HTML 5.)
>>
>> HTML5 is designed so that this doesn't need to be asserted to the  
>> other party when sending HTML5 content to a consuming client. In  
>> the case of an author who is conformance checking his own stuff (as  
>> opposed to communicating with another party), the theory goes that  
>> the authors simply chooses to use a tool that only supports HTML5  
>> or that is configured to support HTML5.
>>
>> This might be a bit inconvenient if during a transition period the  
>> author also wants to target legacy flavors of HTML in some of his  
>> authoring.
>
> This is exactly the situation I need to solve. Taking the pragmatic  
> point of view, a document author may be producing new content in  
> HTML 5, and continue to produce/maintain legacy content targeting  
> some other flavor of HTML. This may be as part of the same site/ 
> directory tree or not. In any case, they desire the tool to just do  
> the right thing, without having to explicitly configure it when  
> working with mixed content.

Some XML editors solve the issue of opting for particular checking on  
a per-document basis by allowing the user to include an editor- 
specific processing instruction in the XML prolog. This doesn't work  
for HTML5, though, since the text/html syntax doesn't have processing  
instructions.

The closest thing currently conforming in text/html would be putting a  
modeline in a comment *after* the doctype. (Putting it before the  
doctype would interfere with doctype sniffing in IE.) However, for the  
time being, you could use the HTML5 doctype as the switch for .html  
files and defer the issue until later.

I suppose we could add a modeline attribute on the root element if its  
content were a non-standard tool-specific configuration identifier to  
prevent general consuming apps from performing mode switching on it.
http://lists.w3.org/Archives/Public/public-html/2007JanMar/0433.html

>>> (We may be talking about a single
>>> document, or traversing a directory tree and processing all  
>>> documents
>>> in the tree. In either case, the document type should be auto  
>>> detected.)
>>
>> Wouldn't that kind of approach fail to detect that a set of  
>> documents isn't fully HTML5-compliant if a document in the set is  
>> autodetected as non-HTML5 and passes checks as whatever it was  
>> detected as?
>
> I'm not sure I understand the question.

Suppose I want to see if the .html files in a directory hierarchy are  
HTML5-compliant. If the documents can declare themselves as non-HTML5  
and avoid being checked as HTML5, I get the wrong answer.

>>> For HTML syntax, the shorted form of the doctype "<!DOCTYPE HTML>"  
>>> is
>>> required. This is sufficiently different from all previous doctypes
>>> that it can be mapped to HTML 5. But since there is no version
>>> information included in the doctype, what happens when the successor
>>> to HTML 5 comes out?
>>
>> When the successor of HTML5 comes out, authors are supposed to  
>> create content according to the requirements of the successor and  
>> no longer according to HTML5.
>>
>> This assumes, of course, that whoever defines the successor of  
>> HTML5 define the successor reasonably, so that conforming HTML5  
>> documents remain conforming and mean the same thing according to  
>> the successor. The obvious problem with that assumption is that so  
>> far definers of HTML flavors have had a tendency to deprecate or  
>> obsolete features. We can hope that the definers of the successors  
>> of HTML5 don't seek to deprecate or obsolete anything unless the  
>> deprecated or obsoleted bit is so harmful that telling every author  
>> that their documents no longer conform is of paramount importance.
>
> Even in the case that HTML5 is fully upwards compatible to the  
> successor, there may still be reasons for conformance checking  
> against HTML5. (They may be valid technical reasons that we can't  
> foresee now, but will in the future. Or the may just be an  
> organization's standard practices, in which case the authors and  
> thus the tools need to support those standard practices.)

If there are issues we don't foresee now but we see when the successor  
of HTML5 is being defined, we can make the successor have a  
distinguishing feature at that time.

>>> For XHTML syntax, the doctype is to be omitted. In this situation,  
>>> how
>>> should I autodetect that we are using XHTML 5 as opposed to some  
>>> other
>>> flavor?
>>
>> By design, you shouldn't. Validator.nu defaults to XHTML5 + SVG 1.1  
>> + MathML 2.0 for application/xhtml+xml. I suggest doing the same  
>> for .xhtml (assuming that the tool in question is a text editor  
>> operating on local files): defaulting to the latest Web-relevant  
>> compound document format combination supported by the checker.
>
> Again, this is a similar problem to HTML5. Without a heuristic that  
> that says XHTML syntax, no doctype, probably XHTML 5 it seems like  
> there isn't a good way to infer an author's intent when the document  
> lives in a tree of documents targeting various specifications.

Other XML editors solve this using an editor-specific PI.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/