[whatwg] HTML 5 parsing - not just for browsers?
dolphinling at myrealbox.com
Sun Feb 12 23:50:57 PST 2006
A site I use has recently had a number of holes in their filters that
time I've spent investigating and helping users to protect themselves
and the website to close the holes, it's become pretty obvious that
string-based filtering (i.e. disallowing certain strings of text) is
futile. It's impossible to realize all the strings that would need to be
blocked, and even if it were done new ones would keep appearing.
A better way to filter user uploaded HTML is to parse it, filter the DOM
by removing all elements and attributes not on a whitelist, and then
re-serialize it and output that. That way you can be assured that what's
outputted is (valid) proven-safe HTML with (in non-empty elements)
So, will the HTML 5 parsing section be of use here? Will it be of use to
things other than browsers? Are there small differences needed because
what's being parsed is a document fragment instead of a document? And
when it's re-serialized, how closely will today's browsers interpret the
original and the new?
More information about the whatwg