[imps] HTML5 and libxml2

Edward Z. Yang edwardzyang at thewritingpot.com
Fri Apr 4 21:06:42 PDT 2008

Hash: SHA1

As per the W3C 5 April 2008 working draft, elements not recognized by
HTML5 in body are still added to the DOM using the "A start tag token
not covered by the previous entries". HTML5 does not specify any
validation mechanism in which to ensure the element has the form
stipulated by tag name, i.e. [A-Za-z-]+

Unfortunately, certain tag names causes libxml2 to choke, and HTML5
doesn't specify any way to:

1. Munge the name into something libxml2 finds acceptable
2. Ignore the tag as invalid

Without modifying the algorithms, (2) is not tenable, so I've been
looking at (1). However, HTML5's tag name stipulations appear to be too
restrictive: they do not allow digits as seen in <h1> and friends, and
aren't even a subset of the allowed XML tag names (XML specifies that a
hyphen cannot lead in a tag name, and allows a greater variety of
punctuation and international characters).

So, in short, due to underlying library limitations I can't put
arbitrary characters in a tag (which is what Firefox actually seems to
do), and I don't know exactly what characters I need to get rid of. Advice?

[1] http://www.w3.org/html/wg/html5/#tag-name
[2] http://www.w3.org/TR/REC-xml/#NT-Name

- --
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the Implementors mailing list