[whatwg] Sandboxing to accommodate user generated content.

Tue Jun 17 15:12:03 PDT 2008

>> I have been reading up on past discussions on sandboxing content, and
>> I feel that it is generally agreed on that there should be some
>> mechanism for marking content as "user generated". The discussion
>> mainly appears to be focused on implementation. Please read my
>> implementation notes at the end of this message on how we can include
>> this function safely for both HTML 4 and HTML 5 browsers, and still
>> allow HTML 4 browsers to function properly.
>>
>> My main arguments for having this feature (in one form or another) in
>> the browser is:
>>
>> - It is future proof. Changes to browsers (for example adding
>> expression support to css) will never again require old sanitizers to
>> be updated.
>
> If the sanitiser uses a whitelist based approach that forbids everything by
> default, and then only allows known elements and attributes; and in the case
> of the style attribute, known properties and values that are safe, then that
> would also be the case.

I have written a sanitizer for html and it is very difficult -
especially since browsers have undocumented bugs in their parsing.

Example: <div colspan=&amp;
style=font-family&#61;expression&#40;alert&#40&quot;hacked&quot&#41&#41
colspan=&amp;>Red</div>

The proof that sanitazing HTML is difficult is the fact that no major
site even attempts it. Even wikipedia use some obscure wiki-language,
instead of implementing a wysiwyg editor.

> Note that sandboxing doesn't entirely remove the need for sanitising user
> generated content on the server, it's just an extra line of defence in case
> something slips through.

Ofcourse. However, the sandbox feature in browser will be fail safe if
user generated content is escaped with &lt; and &gt; before being sent
to the browser - as long as the browser does not have bugs of course.

>> The suggested solution of using an attribute on an <iframe> element
>> for storing the user generated content has several problems;
>> 1: The use of src= as a fallback means that style information will be
>> lost and stylesheets must be loaded again.
> This is not a major problem.  If it uses the same stylesheet, which can be
> cached by the browser, then at worst it results in a 304 Not Modified
> response.

Many small rivers...

>> 2: The use of src= yields problems with iframe heights (since the
>> src-url must be hosted on another server javascript cannot fix this)
>> and HTML 4 browsers have no other method of adjusting the iframe
>> height according to the content.
> In recent browsers that support cross-document messaging (Opera 9, Safari 3,
> Firefox 3 and IE 8), you could include a script within the comment page that
> calculates its own height and sends a message to the parent page with the
> info.  In older browsers, just set the height to a reasonable minimum and
> let the user scroll.  Sure, it's not perfect, but it's called graceul
> degradation.

Much more difficult to implement than a <sandbox></sandbox> mechanism
- and I do not see the point giving more work to web developers when
it could be fixed so easily.

>> 3: If you have a page that lists 60 comments on a blog, then the user
>> agent would have to contact the server 60 times to fetch each comment.
>> This again means that perl/php scripts have to be invoked 60 times for
>> one page view - that is 61 separate database connections and session
>> initializations.
> You could always concatenate all of the comments into a single file,
> reducing it down to 1 request.

No you could not, if you for example want people to report comments or
give them votes - which in the Web 2.0 world requires scripting.

>> 4: For the fallback method of using src= for HTML 4 browsers to
>> actually work, the fallback documents must be hosted on a separate
>> domain name. This again means that a website using HTTPS must purchase
>> and maintain two certificates.
> I don't see that as a show stopper.

Well, I am not going to argue anymore. I have not heard anybody talk
in favour of a sandbox mechanism here or contributing something
constructive. Only feedback has been that you could do it with
iframes, and if it looks ugly with HTML 4 browsers, then that is only
graceful degradation, so it is okay. Maybe the future is Flash and
Silverlight afterall. We'll see.

>> If HTML 5 browsers require everything between <htmlarea></htmlarea> to
>> be html entity escaped, that is < and > must be replaced with &lt; and
>> &gt; respectively. If this is not done, HTML 5 browsers will issue a
>> severe warning and refuse to display the page. Developers will quickly
>> learn.
>
> Draconian error handling is something we really want to avoid, particularly
> when the such an error can be triggered by failing to handle user generated
> content properly.

I see that argument. Maybe you have a suggestion to what should happen
if unescaped HTML is encountered then?

>> HTML 4 browsers will never run scripts (since it will only see plain
>> text). HTML 5 browsers will display rich text. It would be completely
>> secure for both HTML 4 and HTML 5 browsers.
>>
>> A simple Javascript could clean up the HTML markup for HTML 4 browsers..
>
> In a separate mail, you wrote:
>> <data>
>> &lt;user supplied input&gt;
>> </data>
>>
>> Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
>> browsers will display html, while HTML 5 browsers will display
>> correctly formatted code. A simple javascript like this (untested)
>> would make the data tags readable for HTML 4 browsers:
>>
>> var els = document.getElementsByTagName("DATA");
>> for(e in els) els[e].innerHTML =
>> els[e].innerHTML.replace(/<&#91;^>&#93;*>/g, "").replace(/\n/g,
>> "<br>");
>
> At first, I had no idea what that script was trying to do.  But AFAICT, you
> were trying to use this regex: /<[^>]*>/g, which would theoretically match
> "<foo>".  But, in this context, even with the corrected regex, the script is
> entirely useless.

Yes, but you are taking attention away from the point. No matter how
many bugs there are in my *untested* script, my method will be
completely safe in all browsers. Please comment on that instead.
Creating a working script would take me an hour max.

> It wouldn't work, for example, with <foo bar=">" baz="xxx">.  But also
> because the inner HTML that you're running the regex on is supposed to have
> all < and > escaped, and so nothing would be matched anyway.

This only means that if people insert invalid HTML then their message
will look ugly - it is not a security issue. If people insert actual
real life HTML then the script would generate nice looking plain text
without html markup (as long as the script is modified to use &lt; and
&gt; instead of < and > in the regexp).

>> A problem with this approach is that developers might forget to escape
>> tags, therefore I think browsers should display a security warning
>> message if the character < or > is encountered inside a <data> tag.
> If a developer forgot to escape the markup at all, then a user could enter
> "</data><script>...</script>" and do anything they wanted.

Yes, that is my point. That is why I want the sandbox to display a
severe security warning if the developer has forgotten to escape it.

This method will be safe for all browsers that has ever existed and
that will ever exist in the future. If new features are introduced in
some future version of CSS or HTML - the sandbox is still there and
the applications created today does not need to have their sanitizers
updated, ever.