[whatwg] hello list
Simon Pieters
zcorpan at hotmail.com
Sun Apr 16 13:58:27 PDT 2006
Hi,
From: "Serban Ghita" <serban.ghita at verasys.com>
>I have a web crawler, that i am using for personal research. It crawls the
>entire site, finding all the links and creating a sitemap, and grabs some
>statistics. After a while i felt that i can do more then that, so i have
>decided to make it parse html code and extract some statistics about tags.
You may be interested in Google's Web Authoring Statistics[1].
>For the moment i have created an array with all HTML tags (deprecated ones
>to), grouped by their structure type (block, inline, single - thats how i
>call them). I am parsing the HTML code using regular expressions, but as
>i've searched the net, i saw lots of people saying: dont parse html using
>regex.
You can't reliably parse HTML with regexp because HTML has more complicated
parsing rules.
>I studied a bit more, then i've found the relation between the HTML
>document and the DTD (Document Type Definition) declaration. I've noticed
>that browsers rely on it (the ones that are public are cached, and the
>custom ones are grabbed before the HTML document is parsed).
Actually, browsers don't parse DTDs at all for HTML.
>Can you point me out to some documentation that explains the way a browser
>parses HTML documents, or the way it uses the DTD document for interpreting
>the tags and their attributes.
It is specified in the Parsing section[2] of Web Applications 1.0.
[1] http://code.google.com/webstats/index.html
[2] http://whatwg.org/specs/web-apps/current-work/#parsing
Regards,
Simon Pieters
More information about the whatwg
mailing list