[whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state
ian at hixie.ch
Wed Apr 7 16:55:02 PDT 2010
On Wed, 3 Mar 2010, Brett Zamir wrote:
> On 3/2/2010 6:54 PM, Ian Hickson wrote:
> > On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:
> > > Briefly it seems that<? causes the parser to go into Bogus comment
> > > state, which is fair enough. (I wouldn't really recommend that
> > > anyone use processing instructions in HTML syntax anyway.) However
> > > the parser comes out of that state at the first>. Because processing
> > > instructions can contain> and terminate only at the two character
> > > sequence ?> this could cause PI processing to terminate early and
> > > leave a lot more error handling and a confused parser state in the
> > > text yet to come.
> > In HTML4, PIs ended at the first>, not at ?>. "<?target data>" is the
> > syntax of PIs when the SGML options used by HTML4 are applied.
> > In any case, the parser in HTML5 is based on what browsers do, which
> > is also to terminate at the first>. It's unlikely that we can change
> > that, given backwards-compatibility needs.
> Are there really a lot of folks out there depending on old HTML4-style
> processing instructions not being broken?
Not knowingly, but I wouldn't at all be surprised if there were lots of
pages that triggered this, yes. People rely on all kinds of weird things.
(See for example the sample from Philip below.)
> Given that as I understand it such HTML4 processing instructions were
> not even used by any standard at that time, and with XHTML 1.0+
> processing instructions bringing into practice the XML form, and
> especially with all of the progress made in X/HTML5 on harmonizing HTML
> and XHTML, I'd think that it'd really be ideal if this issue would not
> get in the way (along with the unfortunate loss of external DTDs)...
In practice this issue shouldn't get in the way anyway, since PIs aren't
allowed in HTML.
> As long as website creators have the freedom to be sloppy
Authors don't have the freedom to be sloppy.
> why not go a little further to make XML compatibility better?
XML compatibility isn't a goal. There is a minor goal of making it
possible to transition easily from XHTML to HTML. PI-like syntax in XHTML
is only used for two purposes:
- the XML declaration, which can simply be removed when publishing HTML,
and which if not removed will just be ignored (since it never contains
a ">" character, so ending on the first ">" is fine).
- the XML Stylesheet PI, which needs to be converted to a <link> element
anyway, so isn't a problem.
> It'd be a whole lot more appealing to work in both environments out of
> the box than deal with complex server-side conversion solutions...
I don't really understand why you would ever use a PI to be honest.
On Wed, 3 Mar 2010, Philip Taylor wrote:
> Yes, e.g. a load of pages like
> http://www.forex.com.cn/html/2008-01/821561.htm (to pick one example at
> random) say:
> <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
> and don't have the string "?>" anywhere.
On Fri, 5 Mar 2010, Brett Zamir wrote:
> Ok, fair enough. But while it is great that HTML5 seeks to be
> transitional and backwards compatible, HTML5 (thankfully) already breaks
> compatibility for the sake of XML compatibility (e.g., localName or
This is actually just for implementation sanity, it's not about XML syntax
> It seems to me that there should still be a role of eventually
> transitioning into something more full-featured in a fundamental,
> language-neutral way (e.g., supporting a fuller subset of XML's features
> such as external entities and yes, XML-style processing instructions);
> extensible, including the ability to include XML from other namespaces
> which may also encourage or rely on using their own XML processing
> instructions, for those who wish to experiment or supplement the HTML
> standard behavior; and more harmonious and compatible with a simpler
> syntax (i.e., XML's)--even if the more complex syntax is more prominent
> and continues to be supported indefinitely.
People can use XML if they want, but I don't really see a path from
today's HTML to a generic language that doesn't break backwards
compatibility. If you're ok with breaking back-compat, though, there's no
need to worry about HTML at all. Just use XHTML.
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg