[whatwg] Test cases for parsing spec (Was: Re: Provding Better Tools)

Sam Ruby rubys at intertwingly.net
Thu Dec 7 03:10:59 PST 2006


Karl Dubost wrote:
> Sam,
> 
> Le 6 déc. 2006 à 23:13, Sam Ruby a écrit :
>> My original interest was to write a replacement for Python's SGMLLIB, 
>> i.e., one that was not based on the theoretical ideal of how SGML 
>> vocabularies work, but one based on the practical notion of how HTML 
>> actually is parsed.
> 
> I'm not sure sgmllib would be the best target. Specifically if it's used 
> in many other products. But maybe you are talking about a new library 
> altogether.
> 
>     http://docs.python.org/lib/module-sgmllib.html
>     8.2 sgmllib -- Simple SGML parser
> 
>     This module defines a class SGMLParser which serves as the basis for
>     parsing text files formatted in SGML (Standard Generalized Mark-up
>     Language). In fact, it does not provide a full SGML parser -- it only
>     parses SGML insofar as it is used by HTML, and the module only exists
>     as a base for the htmllib module. Another HTML parser which supports
>     XHTML and offers a somewhat different interface is available in the
>     HTMLParser module.
> 
> It seems a better candidate.
> 
>     http://docs.python.org/lib/module-HTMLParser.html
>     8.1 HTMLParser -- Simple HTML and XHTML parser
> 
>      New in version 2.2.
> 
>     This module defines a class HTMLParser which serves as the basis for
>     parsing text files formatted in HTML (HyperText Mark-up Language) and
>     XHTML. Unlike the parser in htmllib, this parser is not based on the
>     SGML parser in sgmllib.
> 
> I'm adding them to the list of HTML parsers.
> http://esw.w3.org/topic/HTMLAsSheAreSpoke

htmllib is both based on sgmllib (and shares some of the same issues) 
and is a bit draconian.  It is less suitable for consuming html as 
practiced than sgmllib.

I was originally thinking about creating a htmllib2 much like there is a 
urllib2 (in the library) and an httplib2 (by Joe Gregorio).  Though it 
now looks like it makes more sense to name it httplib5, and potentially 
join forces with others who (may) have similar interests.

- Sam Ruby



More information about the whatwg mailing list