[whatwg] text/html for html and xhtml (Was: Supporting MathML and SVG in text/html, and related topics)

Wed Apr 16 23:25:46 PDT 2008

On 17/04/2008, William F Hammond <hammond at csc.albany.edu> wrote:
>  Previously:
>
>   Yes, but the point is, once a user agent begins to sniff, there's no
>   rational excuse for it not to recognize compliant xhtml+(mathml|svg).

Yes there is. Live content rely on even perfectly well formed XHTML to
have the HTML behaviours of CSS and the DOM. It also relies on all
elements having #PCDATA content. Thus scripts and style sheets would
be given an incompatible parsing that changes the meaning of '&', '<'
and XML comments within scripts, just to take one example. That is, a
script which is well formed and valid XML and which is XML well
formedness-compatible and proper HTML may have entirely textual
content. (The subset of live XHTML content that uses embedded scripts
which are also XML well formed without using explicit CDATA wrapping
is very small, though.)

>   >> What obstacles to this exist?
>   >
>   > The Web.
>
>   Really!?!

Really.

>  And then:
>
>  >>> The Web.
>  >>
>  >> Really!?!
>  >
>  > Yes, see for instance:
>  >
>  >    http://lists.w3.org/Archives/Public/public-html/2007Aug/1248.html
>
>  Taylor's comment is mainly about what happens when a user agent
>  confuses tag soup with good xhtml.
>
>  It is a different question how a user agent decides what it is looking
>  at.
>
>  Whether there is one mimetype or two, erroneous content will need
>  handling.  The experiment begun around 2001 of "punishing" bad
>  documents in application/xhtml+xml seems to have led to that mime type
>  not being much used.

We don't know how big a factor the draconianness of XML parsing really
is. The fact is, the single biggest consumer of those documents has
not begun supporting XHTML yet. Internet Explorer supports HTML and
XML but not the XHTML namespace in XML, nor the XHTML content type.
This alone makes everybody reluctant to serve application/xhtml+xml.
Sure, there are other complications from the XML draconianness than
this, but my point is that these are all compounded, so it's hard to
tell how effectively they have been put to the test. If you could run
the test again with Internet Explorer's non-support taken out of the
equation, then you would be able to say something about it. As it is
currently, you can't know either way.

>  So user agents need to learn how to recognize the good and the bad
>  in both mimetypes.
>
>  Otherwise you have Gresham's Law: the bad documents will drive out the
>  good.
>
>  The logical way to go might be this:
>
>  If it has a preamble beginning with "^<?xml " or a sensible
>  xhtml DOCTYPE declaration or a first element "<html xmlns=...>",
>  then handle it as xhtml unless and until it proves to be non-compliant
>  xhtml (e.g, not well-formed xml, unquoted attributes, munged handling
>  of xml namespaces, ...).  At the point it proves to be bad xhtml reload
>  it and treat it as "regular" html.

Doesn't work. We need DOM and CSS treatment as in HTML, not as in
XHTML, to be compatible with live content for those circumstances too.

>  So most bogus xhtml will then be 1 or 2 seconds slower than good xhtml.
>  Astute content providers will notice that and then do something about it.
>  It provides a feedback mechanism for making the web become better.

So, you argue that a document with an XHTML structure as text/html
should change semantics in ways that will affect functionality,
behaviour and presentation because of e.g. a single unescaped
ampersand in a URI or a single character that breaks because of
encoding?

My opinion:
Any feedback mechanism that directly hurts the user and only
indirectly hurts the publisher, as opposed to a feedback mechanism
that directly notifies the publisher, is totally backwards. Fail
early. Compile time is better than run time because that's instantly
obvious to the programmer - the build isn't compiling, so there
there's no working but buggy build to give users. The analogy for web
content is that you should fail at publishing time instead of viewing
time if possible, because then you HAVE to correct your documents
before you can serve them to the user.

If you want to serve XML to users on the web, you should make sure
your tools cannot possibly serve malformed XML, by making absolutely
certain that the content has correct encoding (any defaulting must
confirm that the content actually conforms to the default encoding),
has a specified content type (defaulting is acceptable for fragments
here, but e.g. uploading raw files should require specifying the type)
and is a well formed fragment or document at publishing time, loudly
rejecting any content that is malformed.   (And by publishing I
include all sources: design templates, content producers, information
from the database, advertisements, comments, trackbacks etc.)
-- 
David "liorean" Andersson