[whatwg] Unsafe SGML minimizations
ian at hixie.ch
Fri Mar 10 16:08:09 PST 2006
On Thu, 8 Sep 2005, Henri Sivonen wrote:
> > I think it's pretty much guarenteed that HTML5's parsing model will be
> > able to generate DOMs that can't be serialised to conformant XML
> > syntax without dataloss.
> I am assuming that those situations do not arise if the document is
> conforming and the loss of details that are lost in XML c14n does not
> count as data loss. It would be very nice if you defined conformance in
> such a way that this assumption held true. :-)
Yes, conformant documents will be such that a conformant HTML5 document
can always be serialised to a conforming XHTML5 document, I think. If that
ever turns out not to be the case, please raise the issue! I think this is
important because people use XML tools then serialise to HTML, and vice
versa (e.g. with CMSes that store data in custom formats).
> > For example, the list of characters that must be recognised as part of
> > an element or attribute name when hitting an unknown element or
> > attribute is bigger than the list of characters XML allows.
> For the purpose of conformance checking, I've gone the other way and
> limited names to ASCII. I think that's OK, because conforming names are
> ASCII. However, I expect that I will have to polish the code that looks
> for unquoted attribute values. (But I think conforming unquoted
> attribute values should not include values that weren't SGML-valid in
> HTML 4.)
As specced, unquoted values can contain pretty much anything.
> > Similarly, a comment in HTML can contain the string "--" (assuming it
> > comes in pairs), while an XML comment cannot. This latter example even
> > affects conforming documents.
> From the HTML-as-SGML point of view, there are two comments in <!-- foo
> -- -- bar -- >, so it would be quite appropriate to convert it into XML
> as <!-- foo --><!-- bar -->. This reasoning does not quite work for
> faithfully converting HTML-as-soup.
That's certainly one way to handle it.
> > I've been looking at misnested tags recently (hence my replying to
> > this e-mail despite normally archiving the e-mails about HTML parsing
> > so that I can get back to them when I start work on that part of the
> > spec). I assume, based on the line of reasoning that you've been
> > describing above, that you would agree with me that we should forego
> > compatibility with IE in the DOM it forms in response to markup such
> > as:
> > <body> <form> <div> </form> TEXT NODE </div> </body>
> > What IE does in this case is make the TEXT NODE's parent be the <div>
> > and its previous sibling be the <form>.
> > What browsers do tends to vary; but with markup such as the above
> > Firefox and Safari interoperate on saying that the </form> is ignored
> > and the form instead continues up to the </body>. However, the exact
> > opposite:
> > <body> <div> <form> </div> TEXT NODE </form> </body>
> > ...does not do the opposite in those browsers, despite (in IE) the DOM
> > being equivalent to the previous case. Here, the </div> is not
> > ignored, it implies the </form> and the TEXT NODE ends up a child of
> > <body>.
> I think it is reasonable to force the DOM into a tree, which necessarily
> means not doing what IE does in some cases.
Agreed. In the case above, I've gone with IE's closing of <form>, so the
rendering would be more IE-compatible, but the DOM is a tree.
> Also, I think a conformance checker should only have to observe the top
> of the open element stack when deciding what to do with an end tag. That
> is, popping due to non-matching end tag would always be opportunistic
> (possibly leading to an error if a matching start is not found).
Yeah, I think the way the spec is defined you can implement a conformance
checker without looking anywhere but the end of the stack. But you'd only
be able to catch one error at a time.
> However, I assume there may be non-conforming cases where browsers would
> want to peek deeper in the stack before deciding whether to discard a
> misnested end tag or pop until the start tag is found (ie. only pop if
> the start was actually found when peeking deeper in the stack).
> Additional testing and/or reading of source would be needed for
> determining if such deep peeking is happening here or if popping the
> 'form' on </div> is opportunistic. (But </form> apparently causes
> neither deep peeking nor opportunistic popping.)
There are cases where you have to do surgery to the middle of the stack.
So yeah, full implementations would have a lot more work to do.
> > Trying to work out all the various cases is giving me a headache...
> Then I hope you sympathize with my selfish desire to get conformance
> checkers exempt from error recovery (ie. allowing them to stop upon
> finding an error).
Hey, now that I've done the work, I want y'all to suffer too. :-P
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg