[whatwg] Unsafe SGML minimizations

Henri Sivonen hsivonen at iki.fi
Thu Sep 8 13:39:54 PDT 2005

On Sep 8, 2005, at 19:03, Ian Hickson wrote:
> On Thu, 8 Sep 2005, Henri Sivonen wrote:
>> On Sep 8, 2005, at 17:26, Ian Hickson wrote:
>>> On Thu, 8 Sep 2005, Henri Sivonen wrote:

>>>>  * tagc omission ie. <foo<bar>...</bar</foo>
>>> Well we have to define what that does, and the most obvious error 
>>> handling
>>> behaviour here is to start the new tag. So effectively, I would say 
>>> we
>>> shoul have TAGC omission.
>> But it would still be an error as far as a conformance checker is
>> concerned, right?
> I don't have an opinion on that either way. I guess it seems 
> reasonable to
> make it an error. At this point I'm more worried about getting the UA
> rules down before worrying about what the author can or can't do.

I view conformance checking as an authoring aid that is supposed to 
help authors make pages that work. Therefore, if there is syntactic 
sugar that is known to cause problems in real browsers, it would be 
helpful if a conformance checker flagged it as an error. To help 
conformance checker developers avoid having to endlessly defend their 
subjective judgment against people who want to keep their errors but 
argue them right ( http://diveintomark.org/archives/2004/08/16/specs ), 
it would be nice if such bad syntactic sugar was proclaimed 
non-conforming in the spec (even if unambiguous error handling was 

Tagc omission breaks in current Opera, which makes tagc omission bad 
syntactic sugar from the practical point of view.

>> I think the HTML5 spec should allow TagSoup to be updated for HTML5 
>> or an
>> equivalent of TagSoup for HTML5 to be written. TagSoup guarantees to 
>> the
>> application that it acts as if it was an XML parser parsing XHTML. 
>> Therefore,
>> XML and, by extension, the SAX2 API contract restrict the attribute 
>> names to
>> legal XML attribute names. If HTML5 required "/bar/" to be reported 
>> as an
>> attribute name, TagSoup would have to violate that constraint and 
>> could not
>> claim conformance.
> I think it's pretty much guarenteed that HTML5's parsing model will be
> able to generate DOMs that can't be serialised to conformant XML syntax
> without dataloss.

I am assuming that those situations do not arise if the document is 
conforming and the loss of details that are lost in XML c14n does not 
count as data loss. It would be very nice if you defined conformance in 
such a way that this assumption held true. :-)

> For example, the list of characters that must be recognised as part of 
> an
> element or attribute name when hitting an unknown element or attribute 
> is
> bigger than the list of characters XML allows.

For the purpose of conformance checking, I've gone the other way and 
limited names to ASCII. I think that's OK, because conforming names are 
ASCII. However, I expect that I will have to polish the code that looks 
for unquoted attribute values. (But I think conforming unquoted 
attribute values should not include values that weren't SGML-valid in 
HTML 4.)

> Similarly, a comment in
> HTML can contain the string "--" (assuming it comes in pairs), while an
> XML comment cannot. This latter example even affects conforming 
> documents.

 From the HTML-as-SGML point of view, there are two comments in
<!-- foo --    -- bar --  >, so it would be quite appropriate to 
convert it into XML as
<!-- foo --><!-- bar -->. This reasoning does not quite work for 
faithfully converting HTML-as-soup.

I am dodging this issue by parsing as if HTML-as-SGML was the case here 
syntactically and not reporting comment parse events at all. Reporting 
comments to the app is optional in XML and Jing wouldn't want to listen 
to comment parse events anyway. (In fact, I think there'd be an 
architectural bug if it wanted.)

FWIW, Opera, Deer Park and Safari all represent this case differently 
in the DOM. Opera includes the "--" after "bar" in the value. Deer Park 
does not. Safari does not include comments in the DOM at all.

>>>>  * attribute name omission (except for the well-known "boolean
>>>> attributes")
>>> Again, we have to define error handling. <foo bar baz> will probably 
>>> just
>>> be equivalent to <foo bar="" baz="">.
>> I have previously argued for <foo bar="bar" baz="baz"> in the
>> TagSoup-like scenario, because that would be the same as the treatment
>> required for the "boolean attributes".
> That wouldn't be backwards compatible, IIRC.

OK. I intend to just throw an error on non-boolean minimized attributes.

> I've been looking at misnested tags recently (hence my replying to this
> e-mail despite normally archiving the e-mails about HTML parsing so 
> that I
> can get back to them when I start work on that part of the spec). I
> assume, based on the line of reasoning that you've been describing 
> above,
> that you would agree with me that we should forego compatibility with 
> IE
> in the DOM it forms in response to markup such as:
>    <body> <form> <div> </form> TEXT NODE </div> </body>
> What IE does in this case is make the TEXT NODE's parent be the <div> 
> and
> its previous sibling be the <form>.
> What browsers do tends to vary; but with markup such as the above 
> Firefox
> and Safari interoperate on saying that the </form> is ignored and the 
> form
> instead continues up to the </body>. However, the exact opposite:
>    <body> <div> <form> </div> TEXT NODE </form> </body>
> ...does not do the opposite in those browsers, despite (in IE) the DOM
> being equivalent to the previous case. Here, the </div> is not 
> ignored, it
> implies the </form> and the TEXT NODE ends up a child of <body>.

I think it is reasonable to force the DOM into a tree, which 
necessarily means not doing what IE does in some cases.

Also, I think a conformance checker should only have to observe the top 
of the open element stack when deciding what to do with an end tag. 
That is, popping due to non-matching end tag would always be 
opportunistic (possibly leading to an error if a matching start is not 

However, I assume there may be non-conforming cases where browsers 
would want to peek deeper in the stack before deciding whether to 
discard a misnested end tag or pop until the start tag is found (ie. 
only pop if the start was actually found when peeking deeper in the 
stack). Additional testing and/or reading of source would be needed for 
determining if such deep peeking is happening here or if popping the 
'form' on </div> is opportunistic. (But </form> apparently causes 
neither deep peeking nor opportunistic popping.)

> Trying to work out all the various cases is giving me a headache...

Then I hope you sympathize with my selfish desire to get conformance 
checkers exempt from error recovery (ie. allowing them to stop upon 
finding an error).

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list