From whatwg at adambarth.com Tue Dec 1 00:14:09 2009 From: whatwg at adambarth.com (Adam Barth) Date: Tue, 1 Dec 2009 00:14:09 -0800 Subject: [whatwg] updateWithSanitizedHTML (was Re: innerStaticHTML) In-Reply-To: <3690291C-51B2-4A21-AF3D-E704FB85B9DA@apple.com> References: <7789133a0911301555k60e48190k24df1fb23ec60406@mail.gmail.com> <37DDCFB0-030F-450A-95E7-476535238D4F@apple.com> <7789133a0911301832h79199a60q7487007b28a38b7d@mail.gmail.com> <3690291C-51B2-4A21-AF3D-E704FB85B9DA@apple.com> Message-ID: <7789133a0912010014n5c9c599eseee250ec4c532243@mail.gmail.com> Your main point is well taken. There are some technical reasons why tag whitelisting makes more sense for inline content. For example, consider the case you mentioned on webkit-dev: @id. Inline, @id is problematic because the ids exist in a per-frame namespace, whereas they're harmless when the untrusted content has an entire iframe to itself. Of course, @id is not unique in this respect. For example, will likely get autofilled by the password manager inline and @style can be used to draw all over the page without an iframe's layout contraints. That said, I'm not married to a design with a tag-level whitelist. Do you have a specific alternative in mind? Adam On Mon, Nov 30, 2009 at 7:43 PM, Maciej Stachowiak wrote: > > On Nov 30, 2009, at 6:32 PM, Adam Barth wrote: > >> On Mon, Nov 30, 2009 at 5:43 PM, Maciej Stachowiak wrote: >>> >>> 1) It seems like this API is harder to use than a sandboxed iframe. To >>> use >>> it correctly, you need to determine a whitelist of safe elements and >>> attributes; providing an explicit whitelist at least of tags is >>> mandatory. >>> With a sandboxed iframe, as a Web developer you can just ask the browser >>> to >>> turn off unsafe things and not worry about designing a security policy. >>> Besides ease of use, there is also the concern that a server-side >>> filtering >>> whitelist may be buggy, and if you apply the same whitelist on the client >>> side as backup instead of doing something high level like "disable >>> scripting" then you are less likely to benefit from defense in depth, >>> since >>> you may just replicate the bug. >> >> I should follow up with folks in the ruby-on-rails community to see >> how they view their sanitize API. ?The one person I asked had a >> positive opinion, but we should get a bigger sample size. > > For server-side sanitization, this kind of explicit API is pretty much the > only thing you can do. > >> >> I think updateWithSanitizedHTML has different use cases than @sandbox. >> I think the killer applications for @sandbox are advertisements and >> gadgets. ?In those cases, the developer wants most of the browser's >> functionality, but wants to turn off some dangerous stuff (like >> plug-ins). ?For updateWithSanitizedHTML, the killer application is >> something like blog comments, where you basically want text with some >> formatting tags (bold, italics, and maybe images depending on the >> forum). > > I can imagine use cases where allowing very open-ended but script-free > content is desirable. For example, consider a hosted blog service that wants > to let blog authors write nearly arbitrary HTML, but without allowing > script. @sandbox would not be a good solution for that use case. In general > it does not seem sensible to me that the choice of tag whitelisting vs > high-level feature whitelisting is tied to the choice of embedding content > directly vs. creating a frame. Is there a technical reason these two choices > have to be tied? > >> >>> 2) It seems like this API loses one of the big benefits of sanitizing >>> HTML >>> in the browser implementation. Specifically, in theory it's safe to say >>> "allow everything except any construct that would result in script/code >>> running". You can't do that on the server side - blacklisting is not >>> sound >>> because you can't predict the capabilities of all browsers. But the >>> browser >>> can predict its own capabilities. Sandboxed iframes do allow for this. >> >> The benefit is that you know you're getting the right parsing. ?You're >> not going to be tripped up by > It's true, this is a benefit. However, it seems like even if you whitelist > tags, being able to say "no script" at a high level > >> Also, this API is useful in cases where you don't have a server to help >> you >> sanitize your input. ?One example I saw recently was a GreaseMonkey >> script that wanted to add EXIF metadata to Flickr. ?Basically, the >> script grabbed the EXIF data from api.flickr.com and added it to the >> current page. ?Unfortunately, that meant I could use this GreaseMonkey >> script to XSS Flickr by adding HTML to my EXIF metadata. ?Sure, there >> are other ways of solving the problem (I asked the developer to build >> the DOM in memory and use innerText), but you want something simple >> for these cases. > > If the EXIF metadata is supposed to be text-only, it seems like > updateWithSanitizedHTML would not be easier to use than innerText, or in any > way superior. For cases where it is actually desirable to allow some markup, > it's not clear to me that giving explicit whitelists of what is allowed is > the simple choice. > >> >>> I think the benefits of filtering by tag/attribute/scheme for advanced >>> experts are outweighed by these two disadvantages for basic use, compared >>> to >>> something simple like the original staticInnerHTML idea. Another possible >>> alternative is to express how to sanitize at a higher level, using >>> something >>> similar to sandboxed iframe feature strings. >> >> If you think of @sandbox as being optimized for rich untrusted content >> and updateWithSanitizedHTML as being optimized for poor untrusted >> content, then you'll see that's what the API does already. ?The >> feature string Slashdot wants for its comments is ("a b strong i em", >> "href"), but another message board might want something different. >> For example, 4chan might want ("img", "src alt"). ?I don't think these >> require particularly advanced experts to understand. > > updateWithSanitizedHTML and @sandbox both provide features that the other > does not for reasons that do not seem technically necessary. For example, > updateWithSanitizedHTML could easily have an "allow everything except > script" mode, and @sandbox could easily allow per-tag whitelisting. Then the > choice would be between the resource cost of a frame, and the sandboxing > features that it's impractical to provide without a frame (limiting content > to a bounding box while still allowing styling, allowing script without > affecting the containing content, etc). > >> >>> Here's a problem that exists with both this API and also innerStaticHTML: >>> >>> 3) There is no secure and efficient way to append sanitized contents to >>> an >>> element that already has children. This may result in authors appending >>> with >>> innerHTML += ?(inefficient and insecure!) or insertAdjecentHTML() >>> (efficient >>> but still insecure!). I'm willing to concede that use cases other than >>> "replace existing contents" and "append to existing contents" are fairly >>> exotic. >> >> Maybe we need insertAdjecentSanitizedHTML instead or in addition. ?;) > > Perhaps. The verb "update" is generic enough that it could handle different > kinds of mutations with flags, but perhaps that means it is too vague for a > security-sensitive API. > > Regards, > Maciej > > From ian at hixie.ch Tue Dec 1 01:36:51 2009 From: ian at hixie.ch (Ian Hickson) Date: Tue, 1 Dec 2009 09:36:51 +0000 (UTC) Subject: [whatwg] Web Workers: Worker.onmessage In-Reply-To: References: Message-ID: On Wed, 11 Nov 2009, Simon Pieters wrote: > On Wed, 11 Nov 2009 16:05:53 +0100, Simon Pieters wrote: > > > > Shouldn't setting onmessage on a Worker object enable the port message > > queue? > > > > Currently step 8 of the "run a worker" algorithm enables the port > > message queue for the WorkerGlobalObjectScope side, but it is never > > enabled when going in the other direction, if I'm reading the spec > > correctly. > > Hmm. Actually, step 12 and 13 of the Worker constructor enable the port > message queue for both the inside and outside ports. Why does the "run a > worker" algorithm enable one of them again? Isn't it too early to enable > the port message queues before the worker has run and set 'onmessage'? I suppose that if the worker is slow to start up, it could conceivably receive a message before the event loop exists, which would lead to poorly defined behaviour. I've removed the inner port opening from the constructor's algorithm. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From ian at hixie.ch Tue Dec 1 02:28:10 2009 From: ian at hixie.ch (Ian Hickson) Date: Tue, 1 Dec 2009 10:28:10 +0000 (UTC) Subject: [whatwg] [Web workers] An attribute describing the "best" number of worker to invoke in a delegation use case In-Reply-To: <4AFCA9DB.5000103@mit.edu> References: <4AFB7EB5.7010001@enseirb-matmeca.fr> <4AFB81E0.7000100@mit.edu> <4AFC4AB5.7020606@enseirb-matmeca.fr> <4AFC66A7.2010404@mit.edu> <4AFC72D1.9020904@enseirb-matmeca.fr> <4AFC7F8A.1050703@mit.edu> <4AFCA74D.80905@enseirb-matmeca.fr> <4AFCA9DB.5000103@mit.edu> Message-ID: On Wed, 11 Nov 2009, David Bruant wrote: > > This is a new proposal taking into account the feedback I recieved to > the "[WebWorkers] About the delegation example" message. > > In the delegation example of the WebWorker spec, we can see this line : > "var num_workers = 10;" > > My concern is about the arbitrarity of the "10". I agree that it's suboptimal. However, I think realistically a good implementation of parallel work would need some sort of dynamic performance tuning, continuously slowly ramping up the number of workers while it increases throughput, and when throughput decreases, switching to reducing the number of workers until throughput increases again. That would probably be too complicated to show in an example in the spec. > My proposal is to add an attribute to the navigator object to write this : > "var num_workers = navigator.optimalWorkerNumber; > var items_per_worker = 10000000/num_workers;" (uneven dividing can > easily be solved) > (the name "optimalWorkerNumber" is not good, but I will use it in the > rest of this e-mail) optimalWorkerNumber is a function of time and of the algorithm that the worker implements. I don't think it would solve the problem. > This attribute have the following properties : > - It's only dependant on the hardware, the operating system and the > WebWorker implementation (thus, it is not dynamically computed by the > user agent at each call and two calls in the same > hardware//OS//WebWorker implementation have the same result). > - In the same running conditions (same memory available, same number of > process running concurrently...) running the "same algorithm" (an easy > delegation algorithm) has a significantly better performance with > (navigator.optimalWorkerNumber) workers than > (navigator.optimalWorkerNumber - 1) workers > - In the same running conditions, running the same algorithm has no > significantly better performance with (navigator.optimalWorkerNumber +1) > workers than (navigator.optimalWorkerNumber) workers I do not think it is possible to satisfy all of the above conditions at the same time. On Thu, 12 Nov 2009, David Bruant wrote, in response to Boris saying much the same as what I wrote above: > > => You're perfectly right. I reformulate the definition of "running > conditions" (appearing in condition 2 and 3) as : > "same memory available, same number of process running concurrently, no > other worker running working on the same document". On Thu, 12 Nov 2009, Boris Zbarsky wrote: > > That doesn't help that much, unfortunately. Most simply, consider a > quad-core chip and workers that are completely cpu-bound. If there are > no other processes running, the optimal number of workers is 4. If > there is one other process which is also completely cpu-bound running, > the optimal number of workers is 3 (in the sense that 4 may well not > satisfy your condition 3). This is really the same issue as having one > worker running, but it's some process not under the browser's control. > > If, on the other hand, the workers are I/O bound (esp. network I/O > latency bound), then the optimal number of workers can well be larger > than the number of cores... On Thu, 12 Nov 2009, David Bruant wrote: > > => If you are comparing "no other processes running" and "one other > process which is also completely cpu-bound running", you are not in what > I've called "same running conditions". (because the number of concurrent > processes is different). The point is that the constant can't be constant if it has to return a different number based on conditions outside the UA's control. > I reformulate this way the conditions 2 and 3: > - In "blank conditions" (no other processes/thread running on the CPU, > enough memory to allocate the workers), running the same algorithm (an > easy delegation algorithm) has significantly better performances with > (navigator.optimalWorkerNumber) dedicated workers than with > (navigator.optimalWorkerNumber -1) dedicated workers > - In "blank conditions", running the same algorithm has no significantly > better performances with (navigator.optimalWorkerNumber+1) dedicated > workers than with (navigator.optimalWorkerNumber) dedicated workers This isn't especially useful, since "blank conditions" are never met by a running script (for one, the script is running). > Just to remind, the purpose of this attribute is to decide, at the > beginning of a delegation algorithm what is the "optimal" number of > workers to divide the work in a way that is optimal regarding the > hardware, the OS and the worker implementation. > No matter the running conditions, 2 calls return the same value for the > same hardware//OS//Worker implementation. > The idea behind this property is that even if you start running the > algorithm with a lot of concurrent processes, you create a number of > workers that cannot be ran concurrently at the beginning, but you may > use optimally the ressources later (if the other processes/threads stop > running, what you cannot control, but can still hope that it happens > during the "algorithm lifetime". According to the spec, "workers are > expected to be long-lived"). Basically, this would encourage authors to always use too many workers, as far as I can tell. On Thu, 12 Nov 2009, David Bruant wrote: > > => I think it happens very often. While I'm writing this e-mail, "no > process" is running. About fifty processes are runnable, but not > running. They are passively waiting. My CPU is barely used. You're lucky. :-) My CPU is almost always at at least 25%. > My point is that this number may be available very easily. For example, > in my dual-core, Linux, Firefox 3.5, the number is 2. Why spare an > information that can be useful and reliable (more than measurement at > least !) ? It's actually probably quite rarely 2. It depends on all kinds of factors, like the kind of algorithm, what other programs are running, etc. I still haven't added this feature, as I do not believe the arguments presented form a convincing case. However, if you are still interested in persuing this feature, I encourage you to convince a browser vendor to implement it, as discussed here: http://wiki.whatwg.org/wiki/FAQ#Is_there_a_process_for_adding_new_features_to_a_specification.3F Cheers, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' From kornel at geekhood.net Tue Dec 1 02:38:57 2009 From: kornel at geekhood.net (=?UTF-8?Q?Kornel_Lesi=C5=84ski?=) Date: Tue, 1 Dec 2009 10:38:57 +0000 Subject: [whatwg] updateWithSanitizedHTML (was Re: innerStaticHTML) In-Reply-To: <7789133a0911301555k60e48190k24df1fb23ec60406@mail.gmail.com> References: <7789133a0911301555k60e48190k24df1fb23ec60406@mail.gmail.com> Message-ID: > The WebKit community is considering taking up such an experimental > implementation. Here's my current proposal for how this might work: > > http://docs.google.com/Doc?docid=0AZpchfQ5mBrEZGQ0cDh3YzRfMTJzbTY1cWJrNA&hl=en > > I would appreciate any feedback on the design. Whitelist requires developers to know about potential risks of each element/property, and that's not obvious to everyone: e.g. one might want to allow object/embed (for harmless YouTube videos) without realizing that it enables XSS. It's also non-obvious that style attribute is XSS risk (via behavior property). Higher-level filtering option could allow style attribute, and only filter out that property. Current proposal would need another whitelist for CSS properties. And even whitelist for CSS properties couldn't be used to implement "No external access" policy (allow images with data: urls, allow http: links, but not http: images). This would be useful for webmails and other places where website doesn't want to allow 3rd parties tracking views. "No clickjacking" option might be useful as well. -- regards, Kornel Lesi?ski From lachlan.hunt at lachy.id.au Tue Dec 1 05:28:32 2009 From: lachlan.hunt at lachy.id.au (Lachlan Hunt) Date: Tue, 01 Dec 2009 14:28:32 +0100 Subject: [whatwg]
<* caption> In-Reply-To: References: Message-ID: <4B151A00.1030602@lachy.id.au> Philip J?genstedt wrote: > As currently speced, the proper usage of
is: > >
>
A Bunny
>
The Cutest Animal
>
> > Apart from all that has been said about legacy parsing, leaking style in > IE, etc I would (perhaps not be the first to) add: > > 1. It seems quite easy to confuse or mistype dd/dt. Without guessing how > often authors will get it wrong, I think everyone agrees that (all else > equal) a syntax which is harder to confuse/mistype is better. Yes, I expect we'll see a lot of authors get them reversed, using the dd for the caption if they want the caption below the content. This is likely to occur since existing authors have already learned that dt comes before dd when used within a dl, and because old habbits die hard, they're likely to repeat the pattern within figure. > 2. Only the caption needs to be marked up, the content is implicitly > everything else. While some content may need a wrapping element for > styling, e.g. usually does not. > > 3. Aesthetics. (My eyes are bleeding, but I can't speak for anyone else's.) Some additional reasons why using dt/dd in figure is a bad solution: The simplest workaround presented so far to solve the styling issue in IE is basically to do this:
...
Caption
And then rather than style the figure itself, give some style to the div because some styles don't work properly when applied to the figure. This effectively makes the figure element itself entirely useless, especially given that all browsers lack support for it. It would be far easier for authors to just stick with this entirely hack free alternative that doesn't use the new elements:
...

Caption

At least until browsers actually implement support for
and IE6/7's market share becomes negligible. It will take a couple of years for those events to occur, and there's no need to rush into using the new elements yet. > The main difficulty with coming up with something better seems to have > been finding a name for an element which isn't already taken. If that's > the only issue, why not just take some inspiration from