[imps] HTML5 and libxml2
Henri Sivonen
hsivonen at iki.fi
Sat Apr 5 01:10:40 PDT 2008
On Apr 5, 2008, at 07:06, Edward Z. Yang wrote:
> Unfortunately, certain tag names causes libxml2 to choke, and HTML5
> doesn't specify any way to:
>
> 1. Munge the name into something libxml2 finds acceptable
> 2. Ignore the tag as invalid
>
> Without modifying the algorithms, (2) is not tenable, so I've been
> looking at (1).
[...]
> So, in short, due to underlying library limitations I can't put
> arbitrary characters in a tag (which is what Firefox actually seems to
> do), and I don't know exactly what characters I need to get rid of.
> Advice?
In the Validator.nu HTML parser, I've solved this by having three
available policies:
public enum XmlViolationPolicy {
/**
* Conform to HTML 5, allow XML 1.0 to be violated.
*/
ALLOW,
/**
* Halt when something cannot be mapped to XML 1.0.
*/
FATAL,
/**
* Be non-conforming and alter the infoset to fit
* XML 1.0 when something would otherwise not be
* mappable to XML 1.0.
*/
ALTER_INFOSET
}
It seems like ALLOW isn't a possibility for libxml2.
With ALTER_INFOSET, tag tokens that do not match Namespaces in XML 1.0
NCName are ignored in the tokenizer. This is non-conforming but works
most of the time. (There are many more similar situations you can find
by searching for ALTER_INFOSET in the source.)
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the Implementors
mailing list