[whatwg] Sandboxing to accommodate user generated content.

Tue Jun 17 17:13:40 PDT 2008

The problem with tag warning is, if </data> is the first token inserted,
there will be no warning because the resulting code will be valid.  So the
key question remains: how do you tell unescaped </data> from the closing
</data>?  And the warning, if applicable, will go to the wrong person: to
all readers instead of just one writer.
What is invalid about <img alt=">" src="next.png">?
It is not enough to scratch some JavaScript that will look all right and
correctly sift out plain text for some test cases; you would have to prove
that it does the right thing in all cases.
Contrary to what you say, MediaWiki sanitizes HTML.  You can contribute to
Wikipedia without using their templates; the templates are there just to
make contributing easier.
It should be possible to keep all contributed content in one file with units
identified as document fragments.  You still have one request per one unit
but all of them request the same data file.

-----Original Message-----
From: whatwg-bounces at lists.whatwg.org
[mailto:whatwg-bounces at lists.whatwg.org] On Behalf Of Frode Borli
Sent: Wednesday, June 18, 2008 12:12 AM
To: Lachlan Hunt
Cc: whatwg at lists.whatwg.org
Subject: Re: [whatwg] Sandboxing to accommodate user generated content.

>> I have been reading up on past discussions on sandboxing content, and
>> I feel that it is generally agreed on that there should be some
>> mechanism for marking content as "user generated". The discussion
>> mainly appears to be focused on implementation. Please read my
>> implementation notes at the end of this message on how we can include
>> this function safely for both HTML 4 and HTML 5 browsers, and still
>> allow HTML 4 browsers to function properly.
>>
>> My main arguments for having this feature (in one form or another) in
>> the browser is:
>>
>> - It is future proof. Changes to browsers (for example adding
>> expression support to css) will never again require old sanitizers to
>> be updated.
>
> If the sanitiser uses a whitelist based approach that forbids everything
by
> default, and then only allows known elements and attributes; and in the
case
> of the style attribute, known properties and values that are safe, then
that
> would also be the case.

I have written a sanitizer for html and it is very difficult -
especially since browsers have undocumented bugs in their parsing.

Example: <div colspan=&
style=font-family=expression(alert&#40"hacked&quot&#41&#41
colspan=&>Red</div>

The proof that sanitazing HTML is difficult is the fact that no major
site even attempts it. Even wikipedia use some obscure wiki-language,
instead of implementing a wysiwyg editor.

[snip]

>> 2: The use of src= yields problems with iframe heights (since the
>> src-url must be hosted on another server javascript cannot fix this)
>> and HTML 4 browsers have no other method of adjusting the iframe
>> height according to the content.
> In recent browsers that support cross-document messaging (Opera 9, Safari
3,
> Firefox 3 and IE 8), you could include a script within the comment page
that
> calculates its own height and sends a message to the parent page with the
> info.  In older browsers, just set the height to a reasonable minimum and
> let the user scroll.  Sure, it's not perfect, but it's called graceul
> degradation.

Much more difficult to implement than a <sandbox></sandbox> mechanism
- and I do not see the point giving more work to web developers when
it could be fixed so easily.

>> 3: If you have a page that lists 60 comments on a blog, then the user
>> agent would have to contact the server 60 times to fetch each comment.
>> This again means that perl/php scripts have to be invoked 60 times for
>> one page view - that is 61 separate database connections and session
>> initializations.
> You could always concatenate all of the comments into a single file,
> reducing it down to 1 request.

No you could not, if you for example want people to report comments or
give them votes - which in the Web 2.0 world requires scripting.

[snip]

>> If HTML 5 browsers require everything between <htmlarea></htmlarea> to
>> be html entity escaped, that is < and > must be replaced with < and
>> > respectively. If this is not done, HTML 5 browsers will issue a
>> severe warning and refuse to display the page. Developers will quickly
>> learn.
>
> Draconian error handling is something we really want to avoid,
particularly
> when the such an error can be triggered by failing to handle user
generated
> content properly.

I see that argument. Maybe you have a suggestion to what should happen
if unescaped HTML is encountered then?

>> HTML 4 browsers will never run scripts (since it will only see plain
>> text). HTML 5 browsers will display rich text. It would be completely
>> secure for both HTML 4 and HTML 5 browsers.
>>
>> A simple Javascript could clean up the HTML markup for HTML 4 browsers..
>
> In a separate mail, you wrote:
>> <data>
>> <user supplied input>
>> </data>
>>
>> Then this will be secure both for HTML 4 and HTML 5 browsers. HTML 4
>> browsers will display html, while HTML 5 browsers will display
>> correctly formatted code. A simple javascript like this (untested)
>> would make the data tags readable for HTML 4 browsers:
>>
>> var els = document.getElementsByTagName("DATA");
>> for(e in els) els[e].innerHTML =
>> els[e].innerHTML.replace(/<[^>]*>/g, "").replace(/\n/g,
>> "<br>");
>
> At first, I had no idea what that script was trying to do.  But AFAICT,
you
> were trying to use this regex: /<[^>]*>/g, which would theoretically match
> "<foo>".  But, in this context, even with the corrected regex, the script
is
> entirely useless.

Yes, but you are taking attention away from the point. No matter how
many bugs there are in my *untested* script, my method will be
completely safe in all browsers. Please comment on that instead.
Creating a working script would take me an hour max.

> It wouldn't work, for example, with <foo bar=">" baz="xxx">.  But also
> because the inner HTML that you're running the regex on is supposed to
have
> all < and > escaped, and so nothing would be matched anyway.

This only means that if people insert invalid HTML then their message
will look ugly - it is not a security issue. If people insert actual
real life HTML then the script would generate nice looking plain text
without html markup (as long as the script is modified to use < and
> instead of < and > in the regexp).

>> A problem with this approach is that developers might forget to escape
>> tags, therefore I think browsers should display a security warning
>> message if the character < or > is encountered inside a <data> tag.
> If a developer forgot to escape the markup at all, then a user could enter
> "</data><script>...</script>" and do anything they wanted.

Yes, that is my point. That is why I want the sandbox to display a
severe security warning if the developer has forgotten to escape it.

This method will be safe for all browsers that has ever existed and
that will ever exist in the future. If new features are introduced in
some future version of CSS or HTML - the sandbox is still there and
the applications created today does not need to have their sanitizers
updated, ever.