[whatwg] The problem of duplicate ID as a security issue

Wed May 30 22:28:58 PDT 2007

On Fri, 10 Mar 2006, Mihai Sucan wrote:
> Le Fri, 10 Mar 2006 Alexey Feldgendler <alexey at feldgendler.ru> a écrit:
> >
> > Another solution may be to define functions like getElementById(), 
> > getElementsByTagName() etc so that they don't cross sandbox boundaries 
> > during their recursive search, at least by default. (If the sandbox 
> > proposal makes it to the spec, of course.)

I don't see us using a sandboxing system that isn't based on browsing 
contexts, in which case this is moot.

> This is something I'd opt for. But ... this would be really bad, since 
> the spec would have to change the way getElementBy* functions work. It's 
> bad because you shouldn't make a spec that breaks other specs you rely 
> upon (this has probably already been done in this very spec).

True, we should avoid that where possible. (Sometimes, e.g. the MIME type 
sniffing stuff, we are constrained by the legacy implementations.)

On Mon, 13 Mar 2006, Mihai Sucan wrote:
> 
> Yes... but there's a need for allowing the parent document control 
> sandboxed content. Therefore, it needs a new parameter, for example: 
> getElementById(string id, bool search_in_sandbox). Isn't that changing 
> the getElementById function? Of course this only a way, it could 
> probably be done differently, without changing the function(s).

This presumably wouldn't be needed with browsing context based sandboxes.

> As for scripting, if there's any user wanting to post his/her script in 
> a forum, then that's a problem. I wouldn't ever allow it (except 
> probably for research purposes, such as "how users act when they are 
> given all power" :) ).

Indeed.

On Tue, 14 Mar 2006, Mihai Sucan wrote:
> 
> I've made a short "investigation" regarding how browsers behave with  
> document.getElementById('a-duplicate-ID').
> 
> The page:
> http://www.robodesign.ro/_gunoaie/duplicate-ids.html
> 
> Take a close look into the source (I've provided comments) to understand  
> what the "Click me" tests and what it shows. You'll see major browsers  
> I've tested behave the same: like with a queue, the last node that sets  
> the duplicate ID is also the node that's returned when you use  
> getElementById function.

This seems to be off the grid now. Is there a copy I can look at 
somewhere?

On Wed, 15 Mar 2006, Alexey Feldgendler wrote:
>
> Unfortunately we can't change it in a backwards-compatible way (though 
> we probably can define a stricter behavior for <!DOCTYPE html> only).

Generally we want to avoid adding any more processing modes.

On Tue, 14 Mar 2006, Alexey Feldgendler wrote:
> 
> This is true, but there is a problem with the whitelisting approach: the 
> set of elements and attributes isn't in one-to-one correspondence with 
> the set of broowser features. For example, one can't define a set of 
> elements and attributes which must be removed to prohibit scripting: 
> it's not enough to just remove <script> elements and on* attributes, one 
> must also check attributes which contain URIs to filter out 
> "javascript:".

You must also white-list attribute values, indeed. And this would mean 
checking URI syntax (for instance) and whitelisting URI schemes.

> While filtering the DOM tree by the HTML cleaner is easy, it approaches 
> the problem from the syntax point of view, not semantic. It's more 
> robust to write something like <sandbox scripting="disallow"> to 
> disallow all scripting within the sandbox, including any obscure or 
> future flavors of scripts as well as those enabled by proprietary 
> extensions (like MSIE's "expression()" in CSS). Browser developers know 
> better what makes "all possible kinds of scripts" than the web 
> application developers.

Indeed, sandboxing (probably using <iframe>) is something we'll look at.

> Returning to the duplicate IDs, I think we should define some standard 
> behavior for getElementById() when there is more than one element with 
> the given ID. To lower the possible extent of duplicate ID attacks, I 
> propose that getElementById() should throw an exception in that case. 
> It's better to crash the script than to make it do what the attacker 
> wants.

We can't make it raise an exception; pages depend on this already.

I did some research a few months back, and in a sample of several billion 
documents, 13% had duplicate IDs. 13%!

On Thu, 16 Mar 2006, Mihai Sucan wrote:
> > 
> > I don't.  getElementById is already defined and implemented to deal 
> > with duplicate IDs, there's no need to redefine it in a way that isn't 
> > backwards compatible with existing sites.
> 
> Yes, getElementById is already defined to deal with duplicate IDs by 
> returning null, in DOM Level 3 Core [1]. In DOM Level 2 Core [2], the 
> behaviour is explicitly undefined in this case ("behavior is not defined 
> if more than one element has this ID").
> 
> Yet, the implementations (major User Agents: Opera, Gecko, Konqueror and 
> IE) are the problem, actually. These do not return null, they return the 
> last node which set the ID. That's a problem with security implications, 
> as stated by Alexey in the message starting this thread.

Defining something that doesn't match what pages rely on clearly isn't 
going to work (since the browser vendors would just ignore us). DOM3 Core 
is being ignored, and we should change it.

On Thu, 16 Mar 2006, Mihai Sucan wrote:
> 
> True. Can it be changed? I believe not, since it's already a REC.

It can be changed. A REC that is ignored is worthless.

Various people wrote:
>
> [snip a lot of stuff that was in reply to other e-mails as opposed to 
> proposals for the spec]> 

I have omitted a bunch of stuff that didn't seem relevant. Please let me 
know if you think I skipped something in this thread that you wanted 
considered for the spec.

I've noted some things given in this thread for changes to DOM Core, 
in case that becomes something we need to look at.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'