[imps] HTML5 and libxml2

Edward Z. Yang edwardzyang at thewritingpot.com
Sat Apr 5 08:04:13 PDT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Hickson wrote:
> That (erroneous, as it happens) paragraph is just describing a trend
> in the spec's tag names, it's not a conformance criteria of any kind.
> 
Should I submit a patch fixing the error?

> The conformance criteria is really just that the elements in the
> document have to be the elements defined by the spec.

But the spec also defines behavior when elements are outside of the
spec, i.e. an error-condition. I'd appreciate it if the allowed tag
names is made a normative requirement for such elements.

> You may find this post helpful in determining how to read the HTML5
> spec:

Thanks. I've heard of RFC2119 before, but I didn't realize that
statements of fact that don't use those keywords should not be
considered normative. Many of W3C's specs explicitly state which
statements are normative and which of informative.

> Can you change libxml2 to support more characters? Is there a real 
> technical reason for the limitation, or is it just enforcing XML 
> requirements?

I've pinged the libxml2 list, should have an answer back soon.

> The characters allowed in tag names are by far not the only area
> where XML and HTML differ, so if it is just a matter of libxml2
> enforcing XML's requirements, it will not work well.

What are these differences explicitly?

> If you can't implement what the spec requires, then make sure to
> document the limitations clearly in your documentation. Meanwhile,
> you can probably get away with replacing unusable characters with
> U+FFFD,

Unfortunately, U+FFFD is an invalid character too. :-)

> or at a pinch, "_", so long as you still use the full tag anems in
> the parser to determine which tags are open. However, make sure to
> document this as being a conformance problem in your documentation.

This might be tricky, and it occurs to me that as long as the
substitution process works the same for the tags, <a@>t</a@> becomes
<a_>t</a_> which is equivalent. I will, of course, document it.

Henri Sivonen:
> With ALTER_INFOSET, tag tokens that do not match Namespaces in XML
> 1.0 NCName are ignored in the tokenizer. This is non-conforming but
> works most of the time. (There are many more similar situations you
> can find by searching for ALTER_INFOSET in the source.)

This is what I had been considering with (2), but it looked like I'd
have to make multiple modifications in the algorithm to get that to
work. I would look at the source, but I can't seem to find it!
All I can find is the build script, and I don't have Python.

- --
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH95TtqTO+fYacSNoRAk+IAJ4gPLXGHbSuAsUQBaO2Fgu4XMm5WQCfbSd/
JAcnZflMEh0uxRbJ2gwww9E=
=t+U6
-----END PGP SIGNATURE-----



More information about the Implementors mailing list