[whatwg] Sandboxing to accommodate user generated content.

Tue Feb 17 17:26:32 PST 2009

On Tue, 17 Jun 2008, Frode Børli wrote:
> 
> A major challenge for many web developers is validating "untrusted" content
> such as the message body of a blog comment. Unless the developer has a
> flawless and future proof algorithm for ensuring that the message body does
> not contain any script, web developers have to resort to text only - or
> bbCode-style markup languages to allow users to post text content with
> richer formatting. [...]
> 
> Another problem which makes future proofing this type of security is that
> standards evolve. A few years ago you could safely allow users to apply
> css-styles to tags. [...]

In general using whitelisting and a real parser and serialiser 
combination, e.g. what html5lib does now, allows one to have pretty secure 
and future-proof sanitiser.

> One solution:
> 
> <htmlarea>User generated content</htmlarea>
> 
> No scripts would ever be allowed to be executed inside this tag. 
> Malicious users could potentially submit "</htmlarea> unsafe content 
> <htmlarea>" and get around this. There are as I can see it two solutions 
> to this:
> 
> User generated content inside the tag must be escaped using html 
> entities (but still rendered as html by the user agent), or the author 
> must prevent users from submitting the string "</htmlarea>" and all 
> possible variations of the tag.
> 
> If the first solution is used, then browsers should display a strong 
> security warning if unescaped content is seen between htmlarea-tags on a 
> website (to educated web developers).

HTML5 now has something similar to this:

   <iframe sandbox src="data:text/html;base64,..."></iframe>

...where "..." is the sanitised user-provided content, base64-encoded.

On Tue, 17 Jun 2008, Frode Børli wrote:
> 
> In the discussions I find that backward compatability is absolutely the 
> most important issue. Second is that it must be easy for web developers 
> to use the features.
> 
> The suggested solution of using an attribute on an <iframe> element for 
> storing the user generated content has several problems;
> 
> 1: The use of src= as a fallback means that style information will be 
> lost and stylesheets must be loaded again.

The CSS can be embedded in the iframed snippets in the transition period; 
on the long term, the "seamless" attribute side-steps this issue.

> 2: The use of src= yields problems with iframe heights (since the 
> src-url must be hosted on another server javascript cannot fix this) and 
> HTML 4 browsers have no other method of adjusting the iframe height 
> according to the content.

The "seamless" attribute addresses this also, though admittedly there is 
no good short-term fix for this.

> 3: If you have a page that lists 60 comments on a blog, then the user 
> agent would have to contact the server 60 times to fetch each comment.

With data: URLs, all the comments can be included in the original request.

> 4: For the fallback method of using src= for HTML 4 browsers to actually 
> work, the fallback documents must be hosted on a separate domain name. 
> This again means that a website using HTTPS must purchase and maintain 
> two certificates.

This is a problem with any solution that is intended to work with today's 
browsers without server-side sanitation, indeed.

> If we add a new element <htmlarea></htmlarea>, old browsers will run 
> scripts, while new browsers will stop scripts and this is a major 
> problem.

Indeed.

> If HTML 5 browsers require everything between <htmlarea></htmlarea> to 
> be html entity escaped, that is < and > must be replaced with < and 
> > respectively. If this is not done, HTML 5 browsers will issue a 
> severe warning and refuse to display the page. Developers will quickly 
> learn.

How would the browser know when the </htmlarea> tag is the actual end tag 
or just something that the author forgot to escape?

> HTML 4 browsers will never run scripts (since it will only see plain 
> text). HTML 5 browsers will display rich text. It would be completely 
> secure for both HTML 4 and HTML 5 browsers.
>
> A simple Javascript could clean up the HTML markup for HTML 4 browsers..

Wouldn't that reintroduce the security bugs?

On Wed, 18 Jun 2008, Frode Børli wrote:
> 
> I have written a sanitizer for html and it is very difficult - 
> especially since browsers have undocumented bugs in their parsing.
> 
> Example: <div colspan=&
> style=font-family=expression(alert&#40"hacked&quot&#41&#41
> colspan=&>Red</div>

A sanitiser that did what I describe above would not be affected by this 
(or any other similar problem). Basically, you would have to parse the 
content using the HTML5 parser rules, and then reserialise the content, 
dropping any element or attribute or attribute value that is not 
explicitly whitelisted. It is critical that for every allowed attribute, 
the value be parsed using the relevant rules (e.g. CSS for style="", as a 
URL for href="", etc), and then the values therein reserialised in the 
same manner for that language (e.g. only serialising CSS properties that 
have whitelisted property values).

Yes, this is non-trivial, in that you basically have to implement 
everything you want to allow. But it's actually not as bad as you might 
think, because your parsers don't have to be perfect -- they need but 
support the allowed features, and reserialise them from first principles, 
without copying any strings from the original content.

On Tue, 17 Jun 2008, Bob Auger wrote:
> 
> A solution I discovered for this problem (others I'm sure as well that 
> aren't speaking) borrows from the defenses of cross-site request forgery 
> (CSRF) where a non guessable token is used. Take the following example
> 
> <data id="GUID">
> </data>
> </data id="<GUID>">
> 
> GUID would be a temporary GUID value such as 
> 'F9968C5E-CEB2-4faa-B6BF-329BF39FA1E4' that would be tied to the user 
> session. An attacker would be unable to break out of a <data> tag due to 
> the fact that they couldn't guess the closing ID value. This is 
> something that could be built into a web framework (JSP tag/PHP 
> function/asp.net component) that could handle the token generation 
> portion to assist with adoption.

I considered this approach when adding <iframe sandbox>. The problem is 
that it relies on authors actually making up unpredictable unique IDs. In 
practice, I would not be surprised to find that many authors can't do 
this, and don't understand the implications of getting it wrong. For 
example, I'd expect to find that the GUIDs used in examples in the spec 
would be amongst the most commonly used GUIDs, leaving sites totally 
unprotected, and with the owner having a false sense of security.

It's also vulnerable to brute-forcing. If the attacker can guess roughly 
what the GUID might be (e.g. noticing that they all start the same way 
because they're all generated from a MAC address+time or something), then 
they can simply give a whole host of </data ....> tags and hope for the 
best.

Finally, changing the syntax of end tags only works for text/html and 
doesn't solve the problem for XML.

> To take this a step further there may be situations where user content 
> is reflected inside of HTML tags in the following manner such as '<a 
> href="<user generated value">foo</a>'. For situations like this an 
> additional attribute (along the lines of what you propose) could be 
> added to this tag (or any tag for that matter) to instruct the browser 
> that no script/html can execute.
> 
> <a sandbox="true" href="javascript:alert(document.cookie")>asd</a>
> <a sandbox="true" href="<injected value>">asd</a>  (injected value  "
> onload="javascript:alert('wooot')" foo="bar)

Again, one has to whitelist data that comes from the user. Nothing will 
stop that. I don't think we want to start throwing sandbox attributes 
everywhere. Better to just put all such content in a single sandbox 
mechanism, IMHO.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'