[whatwg] The problem of duplicate ID as a security issue

Tue Mar 14 21:52:28 PST 2006

On Wed, 15 Mar 2006 02:36:51 +0600, Mihai Sucan <mihai.sucan at gmail.com>  
wrote:

>> To access the nodes inside sandboxes, the script in the parent document  
>> can eithher "manually" traverse the DOM tree or do the following: first  
>> find all relevant elements in the main document (starting from the root  
>> noode), then find all sandboxes with getElementsByTagName() (which  
>> doesn't dive inside sandboxes, but is able to return the sandboxes  
>> themselves), then continue recursively from each sandbox found. This  
>> involves somewhat more coding work, but I expect that finding all  
>> mathing elements across sandbox boundaries will be a significantly more  
>> unusual task than finding elements in the parent document (outside  
>> sandboxes) or within a given sandbox.

> Yes, I saw Ric's reply. A nice suggestion, but that implies <sandbox> is  
> a documentElement by itself, or is it a DOMSandbox needing to be defined?

Sandboxes are quite special things, so we'll need a DOMSandbox anyway. But  
instead of adding things like getElementById() to the DOMSandbox  
interface, I tend to make the "fake document" which is visible from inside  
the sandbox a member of the sandbox itself. The call will look like  
sandbox.document.getElementById().

>> This is true, but there is a problem with the whitelisting approach:  
>> the set of elements and attributes isn't in one-to-one correspondence  
>> with the set of broowser features. For example, one can't define a set  
>> of elements and attributes which must be removed to prohibit scripting:  
>> it's not enough to just remove <script> elements and on* attributes,  
>> one must also check attributes which contain URIs to filter out  
>> "javascript:". (I know it's a bad example because one would need to  
>> convert javscript: to safe-javascript: anyway, but you've got the idea,  
>> right?)
>>
>> While filtering the DOM tree by the HTML cleaner is easy, it approaches  
>> the problem from the syntax point of view, not semantic. It's more  
>> robust to write something like <sandbox scripting="disallow"> to  
>> disallow all scripting within the sandbox, including any obscure or  
>> future flavors of scripts as well as those enabled by proprietary  
>> extensions (like MSIE's "expression()" in CSS). Browser developers know  
>> better what makes "all possible kinds of scripts" than the web  
>> application developers.
>>
>> Likewise, other browser features are better controlled explicitly ("I  
>> want to disable all external content within this sandbox") than by  
>> filtering the DOM tree. At least because not all new features, like new  
>> ways to load external conteent, come with new elements or attributes  
>> which aren't on the whitelist. Some features reuse existing syntax in  
>> elegant ways.

> Again, good point, but this is not entirely related to "duplicate ID as  
> a security issue". Meaning, you are advocating for the <sandbox>  
> element. That's something I also do, depending the way it's going to be  
> defined (of course).

Yes, really. I've actually gave the wrong subject to the thread. It should  
have been titled "Sandboxing can make contained HTML harmless in more ways  
than just isolating scripts".

> The <sandbox> element would make securing a web application from common  
> security holes and other pitfalls much easier and elegant. Of course, it  
> would also solve the duplicate IDs issue.

Actually, now it seems the only solution to me because, as you say below,  
the behavior on duplicate IDs cannot be changed to a safe way without  
breaking backwaard compatibility.

> I have to somewhat disagree with this, because blogs, CMS and wiki  
> applications must provide the scripts, the "toys" in a WYSIWYG  
> environment. Those can be secured by the application authors in a proper  
> way, and user-scripts should be not allowed. Table sorting, popup menus  
> and similar are all toys. Does Wikipedia allow full-scripting access? I  
> believe they allow access to some toys only.

They don't provide any JavaScript toys. I hope they'll do it someday.

>> Returning to the duplicate IDs, I think we should define some standard  
>> behavior for getElementById() when there is more than one element with  
>> the given ID. To lower the possible extent of duplicate ID attacks, I  
>> propose that getElementById() should throw an exception in that case.  
>> It's better to crash the script than to make it do what the attacker  
>> wants.

> Bad idea. I've just worked with a guy on a web application done the  
> "industrial way" (as in "get it done ASAP, no matter how"). This was  
> done entirely with copy/pasted frameworks with Java on the server-side,  
> DOJO client-side and some more frameworks (5 to 10!!!). It was horrible:  
> many duplicate IDs, slowly loading ("web 2.0 ready with AJAX"), etc. I  
> was amazed it even worked :). The guy wasn't fully aware of the "behind  
> the scenes" (he didn't even see how badly the generated DOM looks in the  
> browser).
>
> Point is, web applications currently do rely on duplicate IDs support.  
> Throwing errors (thus breaking scripts) also badly breaks backwards  
> compatibility. That web application is not the only one having such  
> badly coded backend, it's one of many (look at most corporate web sites  
> done in "a snap" by "gurus").

Well, if browsers did throw exceptions on duplicate IDs, there wouldn't be  
any duplicate IDs in existing applications. The problem is that there are  
already such applications.

(A wild thought: maybe enforce ID uniqueness only for <!DOCTYPE html>?)

>> For these applications, user-supplied JavaScript is highly demanded,  
>> and it can't be fulfilled by a limited set of predefined JavaScript  
>> toys.
>>
>> They also need IDs for navigational purposes.

> Predefined toys are enough. It's almost useless to allow scripts to run  
> in a sandboxed "frame-like" environment: in your blog article, without  
> being able to interact with the page navigation (which is outside the  
> sandbox), and do other stuff.

Someone could post a JavaScript game in his blog, a horoscope calculator  
etc.

And, by the way, blog entries aren't the only place where sandboxing can  
be appliied in blogs. For example, LiveJournal allows user-defined journal  
styles which are written by the users in a self-invented programming  
language which outputs HTML. That HTML goes through the HTML cleaner  
afterwards, of course. Manny people would love to add dynamic menus, AJAX  
comments folding etc to their styles. This could be partly solved with a  
set of predefined "toys", but actually the entire LiveJournal styling  
system is about user-initiated development. Those with programming skills  
write new styles, and other users may take and use them.

-- Opera M2 9.0 TP2 on Debian Linux 2.6.12-1-k7
* Origin: X-Man's Station at SW-Soft, Inc. [ICQ: 115226275]  
<alexey at feldgendler.ru>