[html5] Identifying HTML 5 documents? (vs. alternate flavors)

Mon Feb 4 08:39:23 PST 2008

On Feb 4, 2008, at 11:24 AM, Henri Sivonen wrote:

> On Feb 4, 2008, at 17:28, Jim Correia wrote:
>
>> I know there has been some discussion about this on the forum. But
>> after having read through the draft spec and the FAQ, I'm still a
>> little unclear about how I can auto-detect that a document is using
>> HTML 5.
>
> The short answer is that HTML5 by design tries to discourage you  
> from trying to do that.

I can understand that discouraging user-agents from doing this might  
be a good thing. At the same time, it appears to make life more  
difficult for those of us who produce authoring tools which must  
support legacy formats alongside HTML 5.

>> (Or more precisely, that the author of the document intended
>> it to be conformant to HTML 5.)
>
> HTML5 is designed so that this doesn't need to be asserted to the  
> other party when sending HTML5 content to a consuming client. In the  
> case of an author who is conformance checking his own stuff (as  
> opposed to communicating with another party), the theory goes that  
> the authors simply chooses to use a tool that only supports HTML5 or  
> that is configured to support HTML5.
>
> This might be a bit inconvenient if during a transition period the  
> author also wants to target legacy flavors of HTML in some of his  
> authoring.

This is exactly the situation I need to solve. Taking the pragmatic  
point of view, a document author may be producing new content in HTML  
5, and continue to produce/maintain legacy content targeting some  
other flavor of HTML. This may be as part of the same site/directory  
tree or not. In any case, they desire the tool to just do the right  
thing, without having to explicitly configure it when working with  
mixed content.

>> (We may be talking about a single
>> document, or traversing a directory tree and processing all documents
>> in the tree. In either case, the document type should be auto  
>> detected.)
>
> Wouldn't that kind of approach fail to detect that a set of  
> documents isn't fully HTML5-compliant if a document in the set is  
> autodetected as non-HTML5 and passes checks as whatever it was  
> detected as?

I'm not sure I understand the question.

>> For HTML syntax, the shorted form of the doctype "<!DOCTYPE HTML>" is
>> required. This is sufficiently different from all previous doctypes
>> that it can be mapped to HTML 5. But since there is no version
>> information included in the doctype, what happens when the successor
>> to HTML 5 comes out?
>
> When the successor of HTML5 comes out, authors are supposed to  
> create content according to the requirements of the successor and no  
> longer according to HTML5.
>
> This assumes, of course, that whoever defines the successor of HTML5  
> define the successor reasonably, so that conforming HTML5 documents  
> remain conforming and mean the same thing according to the  
> successor. The obvious problem with that assumption is that so far  
> definers of HTML flavors have had a tendency to deprecate or  
> obsolete features. We can hope that the definers of the successors  
> of HTML5 don't seek to deprecate or obsolete anything unless the  
> deprecated or obsoleted bit is so harmful that telling every author  
> that their documents no longer conform is of paramount importance.

Even in the case that HTML5 is fully upwards compatible to the  
successor, there may still be reasons for conformance checking against  
HTML5. (They may be valid technical reasons that we can't foresee now,  
but will in the future. Or the may just be an organization's standard  
practices, in which case the authors and thus the tools need to  
support those standard practices.)

>> For XHTML syntax, the doctype is to be omitted. In this situation,  
>> how
>> should I autodetect that we are using XHTML 5 as opposed to some  
>> other
>> flavor?
>
> By design, you shouldn't. Validator.nu defaults to XHTML5 + SVG 1.1  
> + MathML 2.0 for application/xhtml+xml. I suggest doing the same  
> for .xhtml (assuming that the tool in question is a text editor  
> operating on local files): defaulting to the latest Web-relevant  
> compound document format combination supported by the checker.

Again, this is a similar problem to HTML5. Without a heuristic that  
that says XHTML syntax, no doctype, probably XHTML 5 it seems like  
there isn't a good way to infer an author's intent when the document  
lives in a tree of documents targeting various specifications.

- Jim