[whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

Wed Apr 7 16:55:02 PDT 2010

On Wed, 3 Mar 2010, Brett Zamir wrote:
> On 3/2/2010 6:54 PM, Ian Hickson wrote:
> > On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:
> >    
> > > Briefly it seems that<? causes the parser to go into Bogus comment 
> > > state, which is fair enough. (I wouldn't really recommend that 
> > > anyone use processing instructions in HTML syntax anyway.) However 
> > > the parser comes out of that state at the first>. Because processing 
> > > instructions can contain> and terminate only at the two character 
> > > sequence ?> this could cause PI processing to terminate early and 
> > > leave a lot more error handling and a confused parser state in the 
> > > text yet to come.
> >
> > In HTML4, PIs ended at the first>, not at ?>. "<?target data>" is the 
> > syntax of PIs when the SGML options used by HTML4 are applied.
> > 
> > In any case, the parser in HTML5 is based on what browsers do, which 
> > is also to terminate at the first>. It's unlikely that we can change 
> > that, given backwards-compatibility needs.
> 
> Are there really a lot of folks out there depending on old HTML4-style 
> processing instructions not being broken?

Not knowingly, but I wouldn't at all be surprised if there were lots of 
pages that triggered this, yes. People rely on all kinds of weird things. 
(See for example the sample from Philip below.)

> Given that as I understand it such HTML4 processing instructions were 
> not even used by any standard at that time, and with XHTML 1.0+ 
> processing instructions bringing into practice the XML form, and 
> especially with all of the progress made in X/HTML5 on harmonizing HTML 
> and XHTML, I'd think that it'd really be ideal if this issue would not 
> get in the way (along with the unfortunate loss of external DTDs)...

In practice this issue shouldn't get in the way anyway, since PIs aren't 
allowed in HTML.

> As long as website creators have the freedom to be sloppy

Authors don't have the freedom to be sloppy.

> why not go a little further to make XML compatibility better?

XML compatibility isn't a goal. There is a minor goal of making it 
possible to transition easily from XHTML to HTML. PI-like syntax in XHTML 
is only used for two purposes:

 - the XML declaration, which can simply be removed when publishing HTML, 
   and which if not removed will just be ignored (since it never contains 
   a ">" character, so ending on the first ">" is fine).

 - the XML Stylesheet PI, which needs to be converted to a <link> element 
   anyway, so isn't a problem.

> It'd be a whole lot more appealing to work in both environments out of 
> the box than deal with complex server-side conversion solutions...

I don't really understand why you would ever use a PI to be honest.

On Wed, 3 Mar 2010, Philip Taylor wrote:
> 
> Yes, e.g. a load of pages like 
> http://www.forex.com.cn/html/2008-01/821561.htm (to pick one example at 
> random) say:
> 
>   <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
> 
> and don't have the string "?>" anywhere.

Indeed.

On Fri, 5 Mar 2010, Brett Zamir wrote:
> 
> Ok, fair enough.  But while it is great that HTML5 seeks to be 
> transitional and backwards compatible, HTML5 (thankfully) already breaks 
> compatibility for the sake of XML compatibility (e.g., localName or 
> getElementsByTagNameNS).

This is actually just for implementation sanity, it's not about XML syntax 
compatibility.

> It seems to me that there should still be a role of eventually 
> transitioning into something more full-featured in a fundamental, 
> language-neutral way (e.g., supporting a fuller subset of XML's features 
> such as external entities and yes, XML-style processing instructions); 
> extensible, including the ability to include XML from other namespaces 
> which may also encourage or rely on using their own XML processing 
> instructions, for those who wish to experiment or supplement the HTML 
> standard behavior; and more harmonious and compatible with a simpler 
> syntax (i.e., XML's)--even if the more complex syntax is more prominent 
> and continues to be supported indefinitely.

People can use XML if they want, but I don't really see a path from 
today's HTML to a generic language that doesn't break backwards 
compatibility. If you're ok with breaking back-compat, though, there's no 
need to worry about HTML at all. Just use XHTML.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'