[whatwg] The problem of duplicate ID as a security issue

Tue Mar 14 04:03:42 PST 2006

On Tue, 14 Mar 2006 02:09:27 +0600, Mihai Sucan <mihai.sucan at gmail.com>  
wrote:

>> No, it's not really a change in getElementBy* functions. Because there  
>> have been no sandboxes before HTML 5, noone can really expect that  
>> these functions treat sandbox elements the same as all other elements.  
>> Well, sandboxes are "security barriers" by their nature, so it seems,  
>> at least to me, quite natural to have getElementBy* functions stop at  
>> them.

> Yes... but there's a need for allowing the parent document control  
> sandboxed content. Therefore, it needs a new parameter, for example:  
> getElementById(string id, bool search_in_sandbox). Isn't that changing  
> the getElementById function? Of course this only a way, it could  
> probably be done differently, without changing the function(s).

To access the nodes inside sandboxes, the script in the parent document  
can eithher "manually" traverse the DOM tree or do the following: first  
find all relevant elements in the main document (starting from the root  
noode), then find all sandboxes with getElementsByTagName() (which doesn't  
dive inside sandboxes, but is able to return the sandboxes themselves),  
then continue recursively from each sandbox found. This involves somewhat  
more coding work, but I expect that finding all mathing elements across  
sandbox boundaries will be a significantly more unusual task than finding  
elements in the parent document (outside sandboxes) or within a given  
sandbox.

>> Yes, I know, and I think it's wrong. The spec should make <strong>  
>> harmless, at least inside a sandbox.

> How can it do so? Disallowing IDs, class names, ...? Or by changing the  
> way getElement(s)By* work?

I hope that defining getElement(s)By* to not cross sandbox boundaries will  
do the work.

>> CSS has properties that can be used to fit user-supplied content into a  
>> box and make it sit there quietly ("overflow: hidden" etc). The user  
>> can make whatever mess he wants of his own blog entry or whatever but  
>> it won't harm the rest of the page.

> I'm not sure this works in all cases. I haven't tested because I've  
> never been in the position of allowing such user-supplied content in  
> pages and "sandboxing" the user-styled content.

Anyway, even if there are cases when "sandbox {overflow: hidden}" is not  
enough, the possible extent of damage from misplaced content that visually  
"jumps" out of the sandbox is a whole order less than the extent of damage  
 from the exploit shown in my original message. It's more important to  
handle the latter.

A side note: it may help to specify a set of default styling rules for the  
sandbox element so that it doesn't allow visual leakage of content.

>>> The spec can't do much in these situations. Shall the spec provide a  
>>> way for CSS files to *not* be applied in <sandbox>ed content?

>> *:not(sandbox) p { text-align: left; }

> Yes, very interesting. I was aware of this, but I forgot of it.
>
> This would be better used coupled with a suggestion made in a thread  
> "styling the unstylable" (on www-style): style-blocks.

Sorry, I must have completely missed that thread... Can you give me the  
link?

>> Well, of course plain text is the safest. But many applications require  
>> formatting markup in user-supplied text. Some applications don't try to  
>> deal with the security pitfalls of HTML and invent their own markup  
>> syntax (e.g. BBcode). However, there are two things wrong about these:

> Many applications... the only one I can currently think of ... are  
> WYSIWYG editors, discussion forums and all those sites which provide  
> user-comments (blogs, image galleries, etc).

Wikis are a somewhat outstanding example. These traditionally use custom  
markup languages (mainly to make hyperlinking easier), but many of them,  
like MediaWiki, allow a subset of HTML as well. (MediaWiki uses the  
"whitelist" approach, but it seems to be at least theoretically vulnerable  
to the duplicate ID trick.)

> Most of all these applications, if not all, could allow the HTML  
> counter-parts (instead of inventing BBcode, or some other custom  
> markup), but removing all attributes except those allowed (white list,  
> not a black list of attributes). I'd say it would be easier to  
> implement, given the fact server-side technologies provide HTML and XML  
> parsers, hence the manipulation of "user documents" would be easier and  
> faster too (the parsers are usually much faster than unoptimized regular  
> expression matching, string parsing, ... coded by "average" web  
> authors). Removal of unallowed tags and attributes is trivial.

This is true, but there is a problem with the whitelisting approach: the  
set of elements and attributes isn't in one-to-one correspondence with the  
set of broowser features. For example, one can't define a set of elements  
and attributes which must be removed to prohibit scripting: it's not  
enough to just remove <script> elements and on* attributes, one must also  
check attributes which contain URIs to filter out "javascript:". (I know  
it's a bad example because one would need to convert javscript: to  
safe-javascript: anyway, but you've got the idea, right?)

While filtering the DOM tree by the HTML cleaner is easy, it approaches  
the problem from the syntax point of view, not semantic. It's more robust  
to write something like <sandbox scripting="disallow"> to disallow all  
scripting within the sandbox, including any obscure or future flavors of  
scripts as well as those enabled by proprietary extensions (like MSIE's  
"expression()" in CSS). Browser developers know better what makes "all  
possible kinds of scripts" than the web application developers.

Likewise, other browser features are better controlled explicitly ("I want  
to disable all external content within this sandbox") than by filtering  
the DOM tree. At least because not all new features, like new ways to load  
external conteent, come with new elements or attributes which aren't on  
the whitelist. Some features reuse existing syntax in elegant ways.

> Also, the aforementioned applications are not currently required to  
> allow user-supplied tags to contain IDs, class names, scripting and/or  
> styling.

IDs are useful to make anchors for navigation to sections of the page, and  
classs names are useful to style the content in uniformity with the rest  
of the site (for example, Wikipedia's skins define the class "wikitable"  
to make user tables look the same throughout the site). These two features  
are good for the web. Taking them away for security reasons would lower  
the quality of the web content. For example, if Wikipedia disallowed the  
class attribute, then each such table would have to bear physical  
formatting attached to it, which is a step behind.

Of course, comments on forums don't need these features. But I'm talking  
more of your "grade 2" applications.

> I know you are now thinking of WYSIWYG editors ("they must allow users  
> to style their documents"). True. These web applications must also  
> provide "WYSIWYG" editing capabilities for CSS, they can't expect  
> average Jane and Joe to know CSS. Therefore, the list of class names is  
> already known to the WYSIWYG editor, and can easily check the class=  
> attribute to allow *just* the some class names (using the aforementioned  
> parsers and server/client-side DOM manipulation authors can easily limit  
> the list of class names allowed). All the same goes for IDs.

Seems reasonable, though for IDs to be used as navigation anchors there is  
some user inconvenience introduced.

> As for scripting, if there's any user wanting to post his/her script in  
> a forum, then that's a problem. I wouldn't ever allow it (except  
> probably for research purposes, such as "how users act when they are  
> given all power" :) ).

Scripting isn't useful for forum posts, but it is useful in  
blogs/CMS/wikis, mainly because today's HTML sucks. People want things  
like collapsible sections, popup menus, tables with changeable sort order  
etc. (Some of these tasks won't require scripting according to WA1).

>> I've mentioned it in the original message. Though I find it too strict  
>> to strip all id and class attributes from user-supplied text. They  
>> usually do more good than bad.

> I don't. It's not too strict at all. I actually find it very loose to  
> allow these specific attributes. They should be allowed *only* when  
> there are real requirements (especially IDs).

Navigational anchors is a real use case for IDs.

Classes have many use cases, the primary being to avoid presentational in  
favor of semantic formatting. Another harmless but useful way to apply  
classes is the so-called microformats (see http://microformats.org/).

>>> As Mikko said "allowing random user input with possibility to use user  
>>> supplied scripting is next to impossible to make secure".

>> That's what I'm trying to do, and I'm not yet convinced that it's  
>> impossible. This is a hard task but I believe it's what the web needs.

> Yes, this is good. Web-based viruses don't yet exist, but it's only a  
> matter of time.

Java applets exist for many years, but there aren't any viruses  
distributed this way. The framework for the Java applets is so  
well-defined that it's just not possible.

> Do you any other ideas how to do so? In regards to the duplicate IDs  
> issue.

Returning to the duplicate IDs, I think we should define some standard  
behavior for getElementById() when there is more than one element with the  
given ID. To lower the possible extent of duplicate ID attacks, I propose  
that getElementById() should throw an exception in that case. It's better  
to crash the script than to make it do what the attacker wants.

The above only applies to the case when both elements with duplicate IDs  
are in the same space (outside of any sandboxes, for instance). I still  
think that functions that find DOM nodes should not cross sandbox  
boundaries.

>> BTW, my original message shows an exploit which is possible even if the  
>> HTML cleaner doesn't allow scripts.

> Yes, true. I wasn't even talking about allowing user-supplied scripts.  
> That's not even on the horizion of average Jane and Joe, in any of their  
> wildest dreams about cutting-edge WYSIWYG editors :) - or, at least,  
> that shouldn't be.

I think that Jane and Joe probably dream of dynamic menus and collapsible  
sections. But, I agree, this could be handled server-side by providing a  
predefined set of JavaScript "toys".

> - grade 1
> Easy to use, easy to make ones: for blog comments, image gallery  
> comments, even forums.
>
> Scripting: none
> Styling: none
> Tags: p, strong, em, h1-h6, ol, ul, dl, li, dd, dt, ... (and similar)
> Attributes: whatever is "innocent", except IDs and anything the authors  
> consider problematic, including, but not limited to: class and style.

The problem is that you can't always tell what is innocent. Is <a  
href="..."> innocent? Yes, unless it references a javascript: URI.

The key word here is: "...anything the AUTHORS consider problematic". You  
mean the web application authors. But I think that it should be rather the  
browser authors' concern.

> - grade 2
> Full-blown ones: for blog articles, CMSs, ...
>
> Scripting: none
> Styling: yes
> Tags and attributes: same as grade 2, with the exception that these must  
> allow class and style attributes.

For these applications, user-supplied JavaScript is highly demanded, and  
it can't be fulfilled by a limited set of predefined JavaScript toys.

They also need IDs for navigational purposes.

> - grade 3
> Web authoring tools: similar to NVU, Dreamweaver, ...
>
> Scripts, styling, tags and attributes: everything.
>
> Security concerns regarding scripting are eliminated in grade 1 and  
> grade 2 WYSIWYG editors, because you can't really expect average Jane  
> and Joe to want to do so scripting for their articles and pages in CMSs.  
> If they'd want, they'd make their own site "by hand".

They probably don't want to do "scripting", they just want these  
interactive things like tables with changeable sort order. If they were  
given the ability to use scripts in their articles, they would find a nice  
JavaScript through a search engine and paste it on the site.

> P.S. You have sent the reply only to me. I suppose it's by mistake  
> (nothing personal was in it). I have sent my reply to your email back to  
> WHATWG (I expect your future replies to also do so - it's a public  
> discussion).

You're right, I've hit the wrong button. Thanks.

-- Opera M2 9.0 TP2 on Debian Linux 2.6.12-1-k7
* Origin: X-Man's Station at SW-Soft, Inc. [ICQ: 115226275]  
<alexey at feldgendler.ru>