[whatwg] Sandboxing to accommodate user generated content.

Tue Jun 17 13:11:32 PDT 2008

Frode Børli wrote:
> I have been reading up on past discussions on sandboxing content, and
> I feel that it is generally agreed on that there should be some
> mechanism for marking content as "user generated". The discussion
> mainly appears to be focused on implementation. Please read my
> implementation notes at the end of this message on how we can include
> this function safely for both HTML 4 and HTML 5 browsers, and still
> allow HTML 4 browsers to function properly.
> 
> My main arguments for having this feature (in one form or another) in
> the browser is:
> 
> - It is future proof. Changes to browsers (for example adding
> expression support to css) will never again require old sanitizers to
> be updated.

If the sanitiser uses a whitelist based approach that forbids everything 
by default, and then only allows known elements and attributes; and in 
the case of the style attribute, known properties and values that are 
safe, then that would also be the case.

> - It does not require much skill and effort from the web developer to
> safely sanitize user content.
> - Security bugs are fixed by browser vendors, and not by each web developer.

Note that sandboxing doesn't entirely remove the need for sanitising 
user generated content on the server, it's just an extra line of defence 
in case something slips through.

> The suggested solution of using an attribute on an <iframe> element
> for storing the user generated content has several problems;
> 
> 1: The use of src= as a fallback means that style information will be
> lost and stylesheets must be loaded again.

This is not a major problem.  If it uses the same stylesheet, which can 
be cached by the browser, then at worst it results in a 304 Not Modified 
response.

> 2: The use of src= yields problems with iframe heights (since the
> src-url must be hosted on another server javascript cannot fix this)
> and HTML 4 browsers have no other method of adjusting the iframe
> height according to the content.

In recent browsers that support cross-document messaging (Opera 9, 
Safari 3, Firefox 3 and IE 8), you could include a script within the 
comment page that calculates its own height and sends a message to the 
parent page with the info.  In older browsers, just set the height to a 
reasonable minimum and let the user scroll.  Sure, it's not perfect, but 
it's called graceul degradation.

> 3: If you have a page that lists 60 comments on a blog, then the user
> agent would have to contact the server 60 times to fetch each comment.
> This again means that perl/php scripts have to be invoked 60 times for
> one page view - that is 61 separate database connections and session
> initializations.

You could always concatenate all of the comments into a single file, 
reducing it down to 1 request.

> 4: For the fallback method of using src= for HTML 4 browsers to
> actually work, the fallback documents must be hosted on a separate
> domain name. This again means that a website using HTTPS must purchase
> and maintain two certificates.

I don't see that as a show stopper.

> My solution:
> 
> If we add a new element <htmlarea></htmlarea>, old browsers will run
> scripts, while new browsers will stop scripts and this is a major
> problem.
> 
> If HTML 5 browsers require everything between <htmlarea></htmlarea> to
> be html entity escaped, that is < and > must be replaced with < and
> > respectively. If this is not done, HTML 5 browsers will issue a
> severe warning and refuse to display the page. Developers will quickly
> learn.

Draconian error handling is something we really want to avoid, 
particularly when the such an error can be triggered by failing to 
handle user generated content properly.

> HTML 4 browsers will never run scripts (since it will only see plain
> text). HTML 5 browsers will display rich text. It would be completely
> secure for both HTML 4 and HTML 5 browsers.
> 
> A simple Javascript could clean up the HTML markup for HTML 4 browsers..

In a separate mail, you wrote:
> <data>
> 
> <user supplied input>
> 
> </data>
> 
> Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
> browsers will display html, while HTML 5 browsers will display
> correctly formatted code. A simple javascript like this (untested)
> would make the data tags readable for HTML 4 browsers:
> 
> var els = document.getElementsByTagName("DATA");
> for(e in els) els[e].innerHTML =
> els[e].innerHTML.replace(/<[^>]*>/g, "").replace(/\n/g,
> "<br>");

At first, I had no idea what that script was trying to do.  But AFAICT, 
you were trying to use this regex: /<[^>]*>/g, which would theoretically 
match "<foo>".  But, in this context, even with the corrected regex, the 
script is entirely useless.

It wouldn't work, for example, with <foo bar=">" baz="xxx">.  But also 
because the inner HTML that you're running the regex on is supposed to 
have all < and > escaped, and so nothing would be matched anyway.

> A problem with this approach is that developers might forget to escape
> tags, therefore I think browsers should display a security warning
> message if the character < or > is encountered inside a <data> tag.

If a developer forgot to escape the markup at all, then a user could 
enter "</data><script>...</script>" and do anything they wanted.

-- 
Lachlan Hunt - Opera Software
http://lachy.id.au/
http://www.opera.com/