[whatwg] Unsafe SGML minimizations

Thu Sep 8 09:03:26 PDT 2005

On Thu, 8 Sep 2005, Henri Sivonen wrote:
> On Sep 8, 2005, at 17:26, Ian Hickson wrote:
> 
> > On Thu, 8 Sep 2005, Henri Sivonen wrote:
> > > 
> > > I think the text/html flavor of HTML5 should not allow the following SGML
> > > minimization features (which are theoretically allowed in HTML 4), because
> > > each of them causes problems in at least one of Opera, Firefox and Safari.
> > > 
> > >  * <>
> > >  * </>
> > 
> > Agreed. Those should generate comment nodes, I think.
> 
> Opera, Firefox and Safari already interoperably handle <> as character data
> (equivalent to <>) and ignore </>.

That works too.

> > >  * tagc omission ie. <foo<bar>...</bar</foo>
> > 
> > Well we have to define what that does, and the most obvious error handling
> > behaviour here is to start the new tag. So effectively, I would say we
> > shoul have TAGC omission.
> 
> But it would still be an error as far as a conformance checker is 
> concerned, right?

I don't have an opinion on that either way. I guess it seems reasonable to 
make it an error. At this point I'm more worried about getting the UA 
rules down before worrying about what the author can or can't do.

> > >  * <foo/bar/
> > 
> > Agreed, sadly. That would be equivalent to something like <foo /bar/="">
> > (or something similar).
> 
> I think the HTML5 spec should allow TagSoup to be updated for HTML5 or an
> equivalent of TagSoup for HTML5 to be written. TagSoup guarantees to the
> application that it acts as if it was an XML parser parsing XHTML. Therefore,
> XML and, by extension, the SAX2 API contract restrict the attribute names to
> legal XML attribute names. If HTML5 required "/bar/" to be reported as an
> attribute name, TagSoup would have to violate that constraint and could not
> claim conformance.

I think it's pretty much guarenteed that HTML5's parsing model will be 
able to generate DOMs that can't be serialised to conformant XML syntax 
without dataloss.

For example, the list of characters that must be recognised as part of an 
element or attribute name when hitting an unknown element or attribute is 
bigger than the list of characters XML allows. Similarly, a comment in 
HTML can contain the string "--" (assuming it comes in pairs), while an 
XML comment cannot. This latter example even affects conforming documents.

> > >  * attribute name omission (except for the well-known "boolean
> > > attributes")
> > 
> > Again, we have to define error handling. <foo bar baz> will probably just
> > be equivalent to <foo bar="" baz="">.
> 
> I have previously argued for <foo bar="bar" baz="baz"> in the 
> TagSoup-like scenario, because that would be the same as the treatment 
> required for the "boolean attributes".

That wouldn't be backwards compatible, IIRC.

I've been looking at misnested tags recently (hence my replying to this 
e-mail despite normally archiving the e-mails about HTML parsing so that I 
can get back to them when I start work on that part of the spec). I 
assume, based on the line of reasoning that you've been describing above, 
that you would agree with me that we should forego compatibility with IE 
in the DOM it forms in response to markup such as:

   <body> <form> <div> </form> TEXT NODE </div> </body>

What IE does in this case is make the TEXT NODE's parent be the <div> and 
its previous sibling be the <form>.

What browsers do tends to vary; but with markup such as the above Firefox 
and Safari interoperate on saying that the </form> is ignored and the form 
instead continues up to the </body>. However, the exact opposite:

   <body> <div> <form> </div> TEXT NODE </form> </body>

...does not do the opposite in those browsers, despite (in IE) the DOM 
being equivalent to the previous case. Here, the </div> is not ignored, it 
implies the </form> and the TEXT NODE ends up a child of <body>.

Trying to work out all the various cases is giving me a headache...

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'