[whatwg] HTML5 doctypes incompatible with XHR if named entities present

Aryeh Gregor Simetrical+w3c at gmail.com
Thu Nov 12 07:14:29 PST 2009

On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> I assume you meant "mostly" as in "most of the pages are well-formed", not
> "pages are mostly well-formed", since the latter is useless, right?
> I did a brief survey of obvious sites fitting those descriptions that I had
> in my browser history at the moment. . . .
> So either you're looking at a totally different dataset or "mostly" is a bit
> of a stretch....

I admit I didn't look closely.  At a guess, maybe the default
WordPress skin(s) are valid XHTML, but custom skins are very popular
for WordPress and those mostly aren't valid XHTML?  MediaWiki is
unreasonably difficult to reskin, so that's not much of a problem for
us . . .

> Sure.  0.01% of all websites is a "significant number".  I just think it's
> broken often enough, and easy enough to break by accident, that relying on
> it working for screen scraping is not likely to be happening on a wide
> scale....

You're probably right.

> Or stop using HTML named entities, yes.

That's not really a very good option, given the size of MediaWiki's
code base and the size of Wikipedia's database, and the ugliness of
trying to remember what   is when reading the HTML source.  It
sounds like we're stuck with a legacy doctype if we don't want to
break screen-scrapers.

More information about the whatwg mailing list