[whatwg] Unsafe SGML minimizations
Henri Sivonen
hsivonen at iki.fi
Thu Sep 8 13:39:54 PDT 2005
On Sep 8, 2005, at 19:03, Ian Hickson wrote:
> On Thu, 8 Sep 2005, Henri Sivonen wrote:
>> On Sep 8, 2005, at 17:26, Ian Hickson wrote:
>>
>>> On Thu, 8 Sep 2005, Henri Sivonen wrote:
>>>> * tagc omission ie. <foo<bar>...</bar</foo>
>>>
>>> Well we have to define what that does, and the most obvious error
>>> handling
>>> behaviour here is to start the new tag. So effectively, I would say
>>> we
>>> shoul have TAGC omission.
>>
>> But it would still be an error as far as a conformance checker is
>> concerned, right?
>
> I don't have an opinion on that either way. I guess it seems
> reasonable to
> make it an error. At this point I'm more worried about getting the UA
> rules down before worrying about what the author can or can't do.
I view conformance checking as an authoring aid that is supposed to
help authors make pages that work. Therefore, if there is syntactic
sugar that is known to cause problems in real browsers, it would be
helpful if a conformance checker flagged it as an error. To help
conformance checker developers avoid having to endlessly defend their
subjective judgment against people who want to keep their errors but
argue them right ( http://diveintomark.org/archives/2004/08/16/specs ),
it would be nice if such bad syntactic sugar was proclaimed
non-conforming in the spec (even if unambiguous error handling was
defined).
Tagc omission breaks in current Opera, which makes tagc omission bad
syntactic sugar from the practical point of view.
>> I think the HTML5 spec should allow TagSoup to be updated for HTML5
>> or an
>> equivalent of TagSoup for HTML5 to be written. TagSoup guarantees to
>> the
>> application that it acts as if it was an XML parser parsing XHTML.
>> Therefore,
>> XML and, by extension, the SAX2 API contract restrict the attribute
>> names to
>> legal XML attribute names. If HTML5 required "/bar/" to be reported
>> as an
>> attribute name, TagSoup would have to violate that constraint and
>> could not
>> claim conformance.
>
> I think it's pretty much guarenteed that HTML5's parsing model will be
> able to generate DOMs that can't be serialised to conformant XML syntax
> without dataloss.
I am assuming that those situations do not arise if the document is
conforming and the loss of details that are lost in XML c14n does not
count as data loss. It would be very nice if you defined conformance in
such a way that this assumption held true. :-)
> For example, the list of characters that must be recognised as part of
> an
> element or attribute name when hitting an unknown element or attribute
> is
> bigger than the list of characters XML allows.
For the purpose of conformance checking, I've gone the other way and
limited names to ASCII. I think that's OK, because conforming names are
ASCII. However, I expect that I will have to polish the code that looks
for unquoted attribute values. (But I think conforming unquoted
attribute values should not include values that weren't SGML-valid in
HTML 4.)
> Similarly, a comment in
> HTML can contain the string "--" (assuming it comes in pairs), while an
> XML comment cannot. This latter example even affects conforming
> documents.
From the HTML-as-SGML point of view, there are two comments in
<!-- foo -- -- bar -- >, so it would be quite appropriate to
convert it into XML as
<!-- foo --><!-- bar -->. This reasoning does not quite work for
faithfully converting HTML-as-soup.
I am dodging this issue by parsing as if HTML-as-SGML was the case here
syntactically and not reporting comment parse events at all. Reporting
comments to the app is optional in XML and Jing wouldn't want to listen
to comment parse events anyway. (In fact, I think there'd be an
architectural bug if it wanted.)
FWIW, Opera, Deer Park and Safari all represent this case differently
in the DOM. Opera includes the "--" after "bar" in the value. Deer Park
does not. Safari does not include comments in the DOM at all.
>>>> * attribute name omission (except for the well-known "boolean
>>>> attributes")
>>>
>>> Again, we have to define error handling. <foo bar baz> will probably
>>> just
>>> be equivalent to <foo bar="" baz="">.
>>
>> I have previously argued for <foo bar="bar" baz="baz"> in the
>> TagSoup-like scenario, because that would be the same as the treatment
>> required for the "boolean attributes".
>
> That wouldn't be backwards compatible, IIRC.
OK. I intend to just throw an error on non-boolean minimized attributes.
> I've been looking at misnested tags recently (hence my replying to this
> e-mail despite normally archiving the e-mails about HTML parsing so
> that I
> can get back to them when I start work on that part of the spec). I
> assume, based on the line of reasoning that you've been describing
> above,
> that you would agree with me that we should forego compatibility with
> IE
> in the DOM it forms in response to markup such as:
>
> <body> <form> <div> </form> TEXT NODE </div> </body>
>
> What IE does in this case is make the TEXT NODE's parent be the <div>
> and
> its previous sibling be the <form>.
>
> What browsers do tends to vary; but with markup such as the above
> Firefox
> and Safari interoperate on saying that the </form> is ignored and the
> form
> instead continues up to the </body>. However, the exact opposite:
>
> <body> <div> <form> </div> TEXT NODE </form> </body>
>
> ...does not do the opposite in those browsers, despite (in IE) the DOM
> being equivalent to the previous case. Here, the </div> is not
> ignored, it
> implies the </form> and the TEXT NODE ends up a child of <body>.
I think it is reasonable to force the DOM into a tree, which
necessarily means not doing what IE does in some cases.
Also, I think a conformance checker should only have to observe the top
of the open element stack when deciding what to do with an end tag.
That is, popping due to non-matching end tag would always be
opportunistic (possibly leading to an error if a matching start is not
found).
However, I assume there may be non-conforming cases where browsers
would want to peek deeper in the stack before deciding whether to
discard a misnested end tag or pop until the start tag is found (ie.
only pop if the start was actually found when peeking deeper in the
stack). Additional testing and/or reading of source would be needed for
determining if such deep peeking is happening here or if popping the
'form' on </div> is opportunistic. (But </form> apparently causes
neither deep peeking nor opportunistic popping.)
> Trying to work out all the various cases is giving me a headache...
Then I hope you sympathize with my selfish desire to get conformance
checkers exempt from error recovery (ie. allowing them to stop upon
finding an error).
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list