[whatwg] Sandboxing to accommodate user generated content.

Wed Jun 18 00:19:51 PDT 2008

Frode Børli wrote:
>>> I have been reading up on past discussions on sandboxing content, and
>>>
>>> My main arguments for having this feature (in one form or another) in
>>> the browser is:
>>>
>>> - It is future proof. Changes to browsers (for example adding
>>> expression support to css) will never again require old sanitizers to
>>> be updated.

Unless some braindead vendor is going to add scripting-in-sandboxing
feature which would be equally braindead to unlimited expression support
in css. You cannot be future proof unless you trust all the players
including ALL possible browser vendors.

>> If the sanitiser uses a whitelist based approach that forbids everything by
>> default, and then only allows known elements and attributes; and in the case
>> of the style attribute, known properties and values that are safe, then that
>> would also be the case.
> 
> I have written a sanitizer for html and it is very difficult -
> especially since browsers have undocumented bugs in their parsing.
> 
> Example: <div colspan=&
> style=font-family=expression(alert&#40"hacked&quot&#41&#41
> colspan=&>Red</div>

Every real sanitizer MUST parse the input and generate its internal DOM.
If you then generate known good serialization of that DOM there's no way
your sanitizer would ever output such code. I, too, have written my own
simplified HTML parser that converts all unknown parts to data (that is,
escape all the following characters: "<>&'). Just parse the input into
DOM and only after that check if for safe content.

You cannot sanitize HTML using only string replacements without
generating a DOM (all of DOM is not needed in the memory at once, it's
possible to process the input as a stream and handle one tag at a time
and only keep a stack of open tag names in addition).

> The proof that sanitazing HTML is difficult is the fact that no major
> site even attempts it. Even wikipedia use some obscure wiki-language,
> instead of implementing a wysiwyg editor.

Wikipedia does sanitize HTML in the content. It does support its own
wiki-language in addition to HTML. For example, Try to input the
following text as is in the wikipedia sandbox page and press "Show preview":

***
>
> Example: <div colspan=&
> style=font-family=expression(alert&#40"hacked&quot&#41&#41
> colspan=&>Red</div>

Some <b>more</b> content <i>here</i>.
***

Works just fine. The content is sanitized and unregognized parts are
converted to data. Correctly written parts are used as HTML tags.

Trust me, it's really not that hard. The hard part is to decide which
tags and which attributes and which attribute values do you want to
allow. And you have to decide that by yourself - there's no magic silver
bullet safe feature set that is suitable for every usage and for every site.

If you don't want to go through all this trouble, do not try to allow
HTML or any other markup in user generated content unless you *really*
trust your users.

>> Note that sandboxing doesn't entirely remove the need for sanitising user
>> generated content on the server, it's just an extra line of defence in case
>> something slips through.
> 
> Ofcourse. However, the sandbox feature in browser will be fail safe if
> user generated content is escaped with < and > before being sent
> to the browser - as long as the browser does not have bugs of course.

That's a pretty big "if". If the page author / server application
programmer is always able to escape content correctly, how much harder
is it to correctly escape and sanitize the content in anyway?

All this sounds too much like magic_quotes in PHP...

>>> A problem with this approach is that developers might forget to escape
>>> tags, therefore I think browsers should display a security warning
>>> message if the character < or > is encountered inside a <data> tag.
>> If a developer forgot to escape the markup at all, then a user could enter
>> "</data><script>...</script>" and do anything they wanted.
> 
> Yes, that is my point. That is why I want the sandbox to display a
> severe security warning if the developer has forgotten to escape it.

Isn't that a bit too late? If the developer is not testing his
application before the release what's the point of breaking the whole
site in the user's browser as a result? It will not guard against XSS
because the user generated content can be *first* used to end the
sandbox and *then* user to insert XSS attack. Browser sees only valid
content in the sandbox and site is still under XSS attack.

> This method will be safe for all browsers that has ever existed and
> that will ever exist in the future. If new features are introduced in
> some future version of CSS or HTML - the sandbox is still there and
> the applications created today does not need to have their sanitizers
> updated, ever.

That's a pretty bold claim! I guess that a similar claim could have been
said about CSS support before Microsoft added the "expression()" value
syntax.

Can *you* guarantee that a random browser vendor does not implement
anything stupid for the sandbox content in the future?

-- 
Mikko

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20080618/770cdc44/attachment-0001.pgp>