[whatwg] The <iframe> element and sandboxing ideas

Sat Jul 26 00:05:53 PDT 2008

Warning: This is going to be a little bit of an HTML Purifier
evangelising post.

Frode Børli wrote:
> Yeah, I thought about that also. Then we have more complex attributes
> such as style='font-family: expression(a+5);'... So your
> sanitizer must also parse CSS properly - including unescaping
> entities.

The way HTML Purifier handles this is unescaping all entities (hex, dec
and named) before handling HTML. Output text is always in UTF-8 and thus
never has entities.

Also, it should be noted that ( is HTML escaping, not CSS escaping.
CSS has its own set of escaping syntax. HTML Purifier handles that too.

> For all I know - a future invention may introduce a new method of
> encoding entities also, so your sanitizer must support all future
> entity encodings.

I don't know what you really mean by this, but by converting entities to
characters this is not a problem.

> Ofcourse we can skip supporting the style attribute - but there are
> not many other ways to style content in XHTML.

Style attribute is supported.

> A bank want a HTML-messaging system where the customer can write
> HTML-based messages to customer support trough the online banking
> system. Customer support personell have access to perform transactions
> worth millions of dollars trough the intranet web interface (where
> they also receive HTML-based messages from customers).

A few problems with this theoretical situation:

1. Why does the bank need an HTML messaging system?
2. Why is this system on the same domain as the intranet web interface?
3. Why do customer support personell have access to the transaction
interface?

But whatever, it's not really relevant to the topic at hand.

> Security depends on on a perfect sanitizer. Would you sell your
> sanitizer to this bank without any disclaimers, and say that your
> sanitizer will be valid for eternity and for all browsers that the
> bank decides to use internally in the future?

Well, it's an open-source sanitizer. But that aside, say, I was selling
them a support contract, I would not say "valid for eternity". However,
I would be very confident that a bug would be more likely than a future
browser breaking the sanitizer. And the reason I say this is because of
the principle of backwards-compatibility: my sanitizer only allows
HTML/CSS that has well-defined behavior by all current browsers.
colspan="expr(3+4)" is theoretically valid and safe HTML, but it doesn't
have well-defined behavior with browsers, so it is sanitized out.
colspan="4" is well-defined, valid and safe, and unless a browser
decides 4 is a magic number that should trigger the execution of
JavaScript code in a nearby node, it's safe.

> Today I would not allow HTML-based messages since I could never be
> sure enough that the sanitizer was perfect.

I encourage you to try out HTML Purifier <http://htmlpurifier.org>. It's
certainly not perfect (we've had a total of two security problems with
the core code (three if you count a Shift_JIS related vulnerability, and
four if you count an XSS vulnerability in a testing script for the
library)), but I hope it certainly approaches it.