[imps] HTML5 and libxml2

Ian Hickson ian at hixie.ch
Fri Apr 4 22:34:24 PDT 2008


On Sat, 5 Apr 2008, Edward Z. Yang wrote:
> 
> HTML5 does not specify any validation mechanism in which to ensure the 
> element has the form stipulated by tag name, i.e. [A-Za-z-]+

That (erroneous, as it happens) paragraph is just describing a trend in 
the spec's tag names, it's not a conformance criteria of any kind. The 
conformance criteria is really just that the elements in the document have 
to be the elements defined by the spec.

You may find this post helpful in determining how to read the HTML5 spec:

   http://ln.hixie.ch/?start=1140242962&count=1


> Unfortunately, certain tag names causes libxml2 to choke, and HTML5 
> doesn't specify any way to:
> 
> 1. Munge the name into something libxml2 finds acceptable
> 2. Ignore the tag as invalid

Indeed, both of these behaviours would be non-conforming.

Can you change libxml2 to support more characters? Is there a real 
technical reason for the limitation, or is it just enforcing XML 
requirements?

The characters allowed in tag names are by far not the only area where XML 
and HTML differ, so if it is just a matter of libxml2 enforcing XML's 
requirements, it will not work well.


> So, in short, due to underlying library limitations I can't put
> arbitrary characters in a tag (which is what Firefox actually seems to
> do), and I don't know exactly what characters I need to get rid of. Advice?

If you can't implement what the spec requires, then make sure to document 
the limitations clearly in your documentation. Meanwhile, you can probably 
get away with replacing unusable characters with U+FFFD, or at a pinch, 
"_", so long as you still use the full tag anems in the parser to 
determine which tags are open. However, make sure to document this as 
being a conformance problem in your documentation.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



More information about the Implementors mailing list