[whatwg] several messages about XML syntax and HTML5

Mon Dec 4 08:11:09 PST 2006

Le 4 déc. 2006 à 2:55, Ian Hickson a écrit :

> I've been having a lot of trouble following this discussion, because I
> can't work out what it is that is being asked for. There seem to be
> multiple discussions going on, and it isn't clear to me that everybody
> really knows what they are arguing for or against.

This discussion is pretty confusing indeed.

> I've changed the spec to allow a (meaningless) "xmlns" attribute on  
> the
> root <html> element, for the same reasons /> is allowed on void  
> elements
> now. I don't think it's a particularly useful thing, but I'm  
> curious to
> see what people think. (Like anything in the spec, we might remove  
> it in
> due course, based on real world experiences with the spec.)

I think that'll be useful.

>>> Well, SVG itself would arguably be bad because it is poor from a
>>> semantic standpoint.
>>
>> HTML is poor from a semantic standpoint.
>
> HTML is actually pretty rich, all things considered. SVG, on the other
> hand, is media-specific and presentational.

<div> and <span> are poor from a semantic standpoint. They're still  
useful for a variety of reasons and I see no one arguing they should  
be excluded. I'm not saying SVG should or should not be added to  
HTML, but I'm pretty sure inline SVG is useful too.

> On Sat, 2 Dec 2006, Mike Schinkel wrote:
>>
>> But please take into consideration that almost nobody writes web  
>> pages
>> using a DOM; they write web pages using text editors and dynamically
>> using string concatonation. As such there is great value for users in
>> having them be as similar as possible. If they converge, it will
>> accelerate chaos on the web.
>
> With the addition of xmlns="" (see above), they are now as close as
> possible, I believe.

That's probably all that can be done on the HTML side. But would  
something bad happen if you were to make html:lang valid within XHTML?

> On Sat, 2 Dec 2006, Elliotte Harold wrote:
>
>> What I don't understand is why some members of this working group  
>> is so
>> dead set on actively preventing HTML from being XML. The non- 
>> draconian
>> error handling I understand. But why are you disappointed that <! 
>> DOCTYPE
>> html> is well-formed XML? Why the active hostility to well- 
>> formedness?

I was initially disappointed that <!DOCTYPE html> is well-formed  
because I though that it'd allow to differentiate HTML from XHTML  
documents unambiguously (since XHTML documents couldn't have it).  
That said, now I think it's probably irrelevant.

The two format are not the same, but many people have been trying to  
find common ground since XHTML has been invented for various reasons.  
The result is a lot of HTML documents which are wrongly identified as  
XHTML (because they're not even well-formed XML). So I think dropping  
the HTML/XHTML identification string altogether is the right thing to  
do; it's meaningless anyway because a lot of authors are careless.  
Let's use the media type instead, the real thing browsers use to  
differentiate the two, and force people to make things well formed if  
they want it called XHTML by the validator.

> What I'm "hostile" towards is the fiction that you can take an XML  
> parser
> and attempt to parse an HTML document. The two formats aren't the  
> same,
> using the wrong parser is simply that, wrong.

I don't think many people really think this. I think those who say  
that say that because they've been using some subset of HTML which is  
compatible with XML for their own documents, therefore *their* HTML  
documents can be sent through an XML parser. But I'm pretty sure  
people on this list realise that this doesn't apply to the general case.

>    http://wiki.whatwg.org/wiki/ 
> HTML_vs._XHTML#Differences_Between_HTML_and_XHTML

Nice resource. That could be prove very handy.

>> The other half could be addressed by one little box in the corner of
>> Firefox's status bar that's a smiley face if the page is valid, and a
>> frown if it isn't.
>
> A browser that shipped with a frowy face showing on 93% of pages  
> would do
> very badly in usability studies (and thus very badly in the market).

I just want to point out in case someone is interested that there is  
actually a browser like this for the Mac: iCab [1].

  [1]: http://icab.de/

> In the Web Apps 1.0 world, an HTTP message whose headers say text/ 
> html is
> an HTML document, regardless of what sequence of bytes the body of the
> message actually say. An HTTP message whose headers say text/xml,  
> or use
> some other XML MIME type, is an XML document. It's the MIME type that
> decides how it is processed. If it is processed as an HTML  
> document, then
> it _is_ an HTML document, possibly with errors. So says the spec.

I just want to say I like very much this definition.

> On Sat, 2 Dec 2006, Michel Fortin wrote:
>>
>> Having two markups pose the same problem as having two  
>> incompatible HD
>> DVD formats. Browsers do (or will) accept both formats, so as long as
>> the media type is known it'll work fine for them. But what about  
>> every
>> other piece of software in the middle that does not talk directly  
>> to the
>> browser?
>>
>> That's the real difficulty when dealing with HTML and XHTML: the  
>> choice
>> isn't really about tools, it's a choice between two incompatible
>> exchange format. That's the reason why I think it's compelling to  
>> have a
>> common subset between HTML and XHTML. If you can output something  
>> valid
>> for both HTML and XHTML at the same time, then you don't have to  
>> worry
>> about what format is supported on the other end.
>
> The problem is that the common subset would be just that -- a  
> subset. The
> common subset of HTML and XHTML has very few useful features!

I don't see that as a problem. But before arguing the subset is or  
isn't too tiny to be useful, shouldn't we care to define what the  
subset actually is?

The only features of HTML I see that are not supported by the subset  
are <base> vs. xml:base, and that you can't specify encoding within  
the file because one use <?xml encoding=""?> and the other use <meta  
http-equiv="">, but the encoding can still be set as a media type  
parameter). Setting the language is not in the valid part of the  
subset, but if you don't care about validity on the XHTML side you  
could just use html:lang so I'll put it in the functional part of the  
subset.

That doesn't leave much of HTML that can't be expressed by the  
subset. Am I missing something? Which useful features aren't part of  
the subset?

There are of course some differences in CSS and in scripting too, but  
that's nothing that can't be worked around.

> On Sat, 2 Dec 2006, Elliotte Harold wrote:
>
>> James Graham wrote:
>>
>>> Well I think you're hugely mistaken. Any model without support for
>>> error recovery is not suitable for hand authoring (and only  
>>> marginally
>>> suitable for machine authoring).
>>
>> You mean like almost every programming language ever invented? When's
>> the last time you saw error recovery in a C compiler?
>
> JavaScript is a more apt example, since it's used on the Web... and  
> it has
> error recovery all over the place.

I think Elliotte's point is nonetheless valid. There are languages  
pretty lax about the syntax, there are languages pretty strict about  
the syntax; people can hand-write both kinds. The only important  
thing is that they check that it works (by compiling, or by testing  
the page in the browser with the same parser they intend to serve it)  
and that they be able to figure out what's wrong and fix it.

> On Sun, 3 Dec 2006, Elliotte Harold wrote:
>>
>> WordPress allows angle brackets. However I almost never use them.  
>> Instead I
>> use its markdown format. Most other users do the same, I think. [...]
>>
>> I suspect the others you mention are similar. I don't ever  
>> remember using
>> angle brackets on Blogger, but it's been a while.
>
> It would be better to have hard data to work with, rather than  
> having to
> rely on our opinions of this. My own research does not suggest that  
> most
> authors use tools. That over three quarters of pages have major  
> syntactic
> errors leads me to suspect that tools are not going to save the  
> syntax.

I concur with Ian here. Leaving comments on blogs and elsewhere often  
require me to add link as HTML. That doesn't mean that the blog  
software won't fix any incorrect markup I've sent however.

I'd add that even if it was true that hand authoring is a distinct  
minority, do you have an idea how much often people want to bypass  
custom syntaxes and write raw HTML? I'd say pretty often. There's a  
reason why Textile, Markdown and some others lightweight markup  
syntax (as called on Wikipedia) have means to do so.

Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/