<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>On Jul 12, 2010, at 8:39 AM, Nils Dagsson Moskopp wrote:</div><div><blockquote type="cite"><div><br><blockquote type="cite">That's a little different. Google purposely uses unstandardized,<br></blockquote><blockquote type="cite">incorrect HTML in ways that still render in a browser in order to<br></blockquote><blockquote type="cite">make it more difficult for screen scrapers. They also "break it" in a<br></blockquote><blockquote type="cite">different way every week.<br></blockquote><br>Assuming this is true (which I find difficult to believe), wouldn't a<br>screen scraper based on the HTML5 parsing algorithm defeat this<br>purpose ?<br></div></blockquote><br></div><div>Honestly, I don't know. But W3 defaulted to an HTML5 validator:</div><div><a href="http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0">http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0</a></div><div><br></div><div>On Jul 12, 2010, at 7:55 AM, Julian Reschke wrote:</div><blockquote type="cite"><div><font class="Apple-style-span" color="#000000"><br></font>Any evidence for this? And how do you know the reason for CNET isn't the same?<br><br>VERY unconvinced.<br></div></blockquote><br><div>Such skepticism! I really didn't expect that I said anything controversial. We all know Google is a black box, and I don't, nor have I ever worked for them, so I can't testify that what I say is a fact...</div><div><br></div><div>Google does not publish a full API for their search results in spite of the fact that they publish most everything else. That's because their search results / API is their golden egg – they can't give that away. And they need to protect it. I've tried scraping results with YQL and it failed. I made other attempts that had more success only to find they did not work a week later.</div><div><br></div><div>Studying the validation results more closely however does not back up my original statement – the markup does look as I described, but the things I expected to break the validator are not what I expected. The things I had in mind was for the most part unclosed tags, which I didn't know validated. I knew they were allowed, but thought the validator would complain.</div><div><br></div><div><br></div><div>Besides the protecting of their API, Google also will scratch and claw to save every byte. They are the gold standard of a high performance website. While this may or may not explain the things that don't validate, what it does say is that nothing coming from <a href="http://google.com">google.com</a> is accidental.</div><div><br></div><div>Mike</div></body></html>