[whatwg] Article: Growing pains afflict HTML5 standardization
mike at mikewilcox.net
Mon Jul 12 07:43:14 PDT 2010
On Jul 12, 2010, at 8:39 AM, Nils Dagsson Moskopp wrote:
>> That's a little different. Google purposely uses unstandardized,
>> incorrect HTML in ways that still render in a browser in order to
>> make it more difficult for screen scrapers. They also "break it" in a
>> different way every week.
> Assuming this is true (which I find difficult to believe), wouldn't a
> screen scraper based on the HTML5 parsing algorithm defeat this
> purpose ?
Honestly, I don't know. But W3 defaulted to an HTML5 validator:
On Jul 12, 2010, at 7:55 AM, Julian Reschke wrote:
> Any evidence for this? And how do you know the reason for CNET isn't the same?
> VERY unconvinced.
Such skepticism! I really didn't expect that I said anything controversial. We all know Google is a black box, and I don't, nor have I ever worked for them, so I can't testify that what I say is a fact...
Google does not publish a full API for their search results in spite of the fact that they publish most everything else. That's because their search results / API is their golden egg – they can't give that away. And they need to protect it. I've tried scraping results with YQL and it failed. I made other attempts that had more success only to find they did not work a week later.
Studying the validation results more closely however does not back up my original statement – the markup does look as I described, but the things I expected to break the validator are not what I expected. The things I had in mind was for the most part unclosed tags, which I didn't know validated. I knew they were allowed, but thought the validator would complain.
Besides the protecting of their API, Google also will scratch and claw to save every byte. They are the gold standard of a high performance website. While this may or may not explain the things that don't validate, what it does say is that nothing coming from google.com is accidental.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the whatwg