[imps] HTML5 and libxml2

Henri Sivonen hsivonen at iki.fi
Sat Apr 5 01:10:40 PDT 2008


On Apr 5, 2008, at 07:06, Edward Z. Yang wrote:
> Unfortunately, certain tag names causes libxml2 to choke, and HTML5
> doesn't specify any way to:
>
> 1. Munge the name into something libxml2 finds acceptable
> 2. Ignore the tag as invalid
>
> Without modifying the algorithms, (2) is not tenable, so I've been
> looking at (1).
[...]
> So, in short, due to underlying library limitations I can't put
> arbitrary characters in a tag (which is what Firefox actually seems to
> do), and I don't know exactly what characters I need to get rid of.  
> Advice?

In the Validator.nu HTML parser, I've solved this by having three  
available policies:

public enum XmlViolationPolicy {
     /**
      * Conform to HTML 5, allow XML 1.0 to be violated.
      */
     ALLOW,

     /**
      * Halt when something cannot be mapped to XML 1.0.
      */
     FATAL,

     /**
      * Be non-conforming and alter the infoset to fit
      * XML 1.0 when something would otherwise not be
      * mappable to XML 1.0.
      */
     ALTER_INFOSET
}

It seems like ALLOW isn't a possibility for libxml2.

With ALTER_INFOSET, tag tokens that do not match Namespaces in XML 1.0  
NCName are ignored in the tokenizer. This is non-conforming but works  
most of the time. (There are many more similar situations you can find  
by searching for ALTER_INFOSET in the source.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/





More information about the Implementors mailing list