[whatwg] several messages about XML syntax and HTML5
Michel Fortin
michel.fortin at michelf.com
Mon Dec 4 08:11:09 PST 2006
Le 4 déc. 2006 à 2:55, Ian Hickson a écrit :
> I've been having a lot of trouble following this discussion, because I
> can't work out what it is that is being asked for. There seem to be
> multiple discussions going on, and it isn't clear to me that everybody
> really knows what they are arguing for or against.
This discussion is pretty confusing indeed.
> I've changed the spec to allow a (meaningless) "xmlns" attribute on
> the
> root <html> element, for the same reasons /> is allowed on void
> elements
> now. I don't think it's a particularly useful thing, but I'm
> curious to
> see what people think. (Like anything in the spec, we might remove
> it in
> due course, based on real world experiences with the spec.)
I think that'll be useful.
>>> Well, SVG itself would arguably be bad because it is poor from a
>>> semantic standpoint.
>>
>> HTML is poor from a semantic standpoint.
>
> HTML is actually pretty rich, all things considered. SVG, on the other
> hand, is media-specific and presentational.
<div> and <span> are poor from a semantic standpoint. They're still
useful for a variety of reasons and I see no one arguing they should
be excluded. I'm not saying SVG should or should not be added to
HTML, but I'm pretty sure inline SVG is useful too.
> On Sat, 2 Dec 2006, Mike Schinkel wrote:
>>
>> But please take into consideration that almost nobody writes web
>> pages
>> using a DOM; they write web pages using text editors and dynamically
>> using string concatonation. As such there is great value for users in
>> having them be as similar as possible. If they converge, it will
>> accelerate chaos on the web.
>
> With the addition of xmlns="" (see above), they are now as close as
> possible, I believe.
That's probably all that can be done on the HTML side. But would
something bad happen if you were to make html:lang valid within XHTML?
> On Sat, 2 Dec 2006, Elliotte Harold wrote:
>
>> What I don't understand is why some members of this working group
>> is so
>> dead set on actively preventing HTML from being XML. The non-
>> draconian
>> error handling I understand. But why are you disappointed that <!
>> DOCTYPE
>> html> is well-formed XML? Why the active hostility to well-
>> formedness?
I was initially disappointed that <!DOCTYPE html> is well-formed
because I though that it'd allow to differentiate HTML from XHTML
documents unambiguously (since XHTML documents couldn't have it).
That said, now I think it's probably irrelevant.
The two format are not the same, but many people have been trying to
find common ground since XHTML has been invented for various reasons.
The result is a lot of HTML documents which are wrongly identified as
XHTML (because they're not even well-formed XML). So I think dropping
the HTML/XHTML identification string altogether is the right thing to
do; it's meaningless anyway because a lot of authors are careless.
Let's use the media type instead, the real thing browsers use to
differentiate the two, and force people to make things well formed if
they want it called XHTML by the validator.
> What I'm "hostile" towards is the fiction that you can take an XML
> parser
> and attempt to parse an HTML document. The two formats aren't the
> same,
> using the wrong parser is simply that, wrong.
I don't think many people really think this. I think those who say
that say that because they've been using some subset of HTML which is
compatible with XML for their own documents, therefore *their* HTML
documents can be sent through an XML parser. But I'm pretty sure
people on this list realise that this doesn't apply to the general case.
> http://wiki.whatwg.org/wiki/
> HTML_vs._XHTML#Differences_Between_HTML_and_XHTML
Nice resource. That could be prove very handy.
>> The other half could be addressed by one little box in the corner of
>> Firefox's status bar that's a smiley face if the page is valid, and a
>> frown if it isn't.
>
> A browser that shipped with a frowy face showing on 93% of pages
> would do
> very badly in usability studies (and thus very badly in the market).
I just want to point out in case someone is interested that there is
actually a browser like this for the Mac: iCab [1].
[1]: http://icab.de/
> In the Web Apps 1.0 world, an HTTP message whose headers say text/
> html is
> an HTML document, regardless of what sequence of bytes the body of the
> message actually say. An HTTP message whose headers say text/xml,
> or use
> some other XML MIME type, is an XML document. It's the MIME type that
> decides how it is processed. If it is processed as an HTML
> document, then
> it _is_ an HTML document, possibly with errors. So says the spec.
I just want to say I like very much this definition.
> On Sat, 2 Dec 2006, Michel Fortin wrote:
>>
>> Having two markups pose the same problem as having two
>> incompatible HD
>> DVD formats. Browsers do (or will) accept both formats, so as long as
>> the media type is known it'll work fine for them. But what about
>> every
>> other piece of software in the middle that does not talk directly
>> to the
>> browser?
>>
>> That's the real difficulty when dealing with HTML and XHTML: the
>> choice
>> isn't really about tools, it's a choice between two incompatible
>> exchange format. That's the reason why I think it's compelling to
>> have a
>> common subset between HTML and XHTML. If you can output something
>> valid
>> for both HTML and XHTML at the same time, then you don't have to
>> worry
>> about what format is supported on the other end.
>
> The problem is that the common subset would be just that -- a
> subset. The
> common subset of HTML and XHTML has very few useful features!
I don't see that as a problem. But before arguing the subset is or
isn't too tiny to be useful, shouldn't we care to define what the
subset actually is?
The only features of HTML I see that are not supported by the subset
are <base> vs. xml:base, and that you can't specify encoding within
the file because one use <?xml encoding=""?> and the other use <meta
http-equiv="">, but the encoding can still be set as a media type
parameter). Setting the language is not in the valid part of the
subset, but if you don't care about validity on the XHTML side you
could just use html:lang so I'll put it in the functional part of the
subset.
That doesn't leave much of HTML that can't be expressed by the
subset. Am I missing something? Which useful features aren't part of
the subset?
There are of course some differences in CSS and in scripting too, but
that's nothing that can't be worked around.
> On Sat, 2 Dec 2006, Elliotte Harold wrote:
>
>> James Graham wrote:
>>
>>> Well I think you're hugely mistaken. Any model without support for
>>> error recovery is not suitable for hand authoring (and only
>>> marginally
>>> suitable for machine authoring).
>>
>> You mean like almost every programming language ever invented? When's
>> the last time you saw error recovery in a C compiler?
>
> JavaScript is a more apt example, since it's used on the Web... and
> it has
> error recovery all over the place.
I think Elliotte's point is nonetheless valid. There are languages
pretty lax about the syntax, there are languages pretty strict about
the syntax; people can hand-write both kinds. The only important
thing is that they check that it works (by compiling, or by testing
the page in the browser with the same parser they intend to serve it)
and that they be able to figure out what's wrong and fix it.
> On Sun, 3 Dec 2006, Elliotte Harold wrote:
>>
>> WordPress allows angle brackets. However I almost never use them.
>> Instead I
>> use its markdown format. Most other users do the same, I think. [...]
>>
>> I suspect the others you mention are similar. I don't ever
>> remember using
>> angle brackets on Blogger, but it's been a while.
>
> It would be better to have hard data to work with, rather than
> having to
> rely on our opinions of this. My own research does not suggest that
> most
> authors use tools. That over three quarters of pages have major
> syntactic
> errors leads me to suspect that tools are not going to save the
> syntax.
I concur with Ian here. Leaving comments on blogs and elsewhere often
require me to add link as HTML. That doesn't mean that the blog
software won't fix any incorrect markup I've sent however.
I'd add that even if it was true that hand authoring is a distinct
minority, do you have an idea how much often people want to bypass
custom syntaxes and write raw HTML? I'd say pretty often. There's a
reason why Textile, Markdown and some others lightweight markup
syntax (as called on Wikipedia) have means to do so.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
More information about the whatwg
mailing list