[whatwg] base64 entities

Thu Aug 26 15:45:45 PDT 2010

On Wed, Aug 25, 2010 at 6:37 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> On 8/25/10 7:41 PM, Adam Barth wrote:
>> 2) Decoding base64 results in binary data.  We'll need to convert that
>> data to characters in order to deal with it in the DOM.  We use always
>> use UTF8 for that transformation, regardless of the document's
>> encoding.
>
> Note that this issue means that using atob or btoa for dealing with this is
> a huge pain if non-ASCII chars are involved, since those take and return
> byte arrays masquerading as JS strings, not actual Unicode strings.

I'm slightly confused how that works.  How do you represent arbitrary
binary data as characters?  Another option is to provide a base64
encoder/decoder that uses UTF8 to encode/decode the binary.

On Thu, Aug 26, 2010 at 1:38 AM, Martin Janecke <whatwg.org at kaor.in> wrote:
> Is it necessary to consider compatibility issues here? In HTML4 this
> seems to have been valid code (-> http://validator.w3.org/check):

It's always necessary to consider compatibility.  Perhaps one of our
friends with the ability to grep the web would be kind enough to tell
us how common &% followed by base64 characters followed by ; is.

On Thu, Aug 26, 2010 at 2:58 AM, Julian Reschke <julian.reschke at gmx.de> wrote:
> Not convinced. There's already one way to escape these things, and this is
> supported in all UAs.

Which way is that?

> I don't see how adding another mechanism will help those who can't use the
> first one properly. For instance, people unable to escape "<", ">" and "&"
> are likely also unable to get the UTF-8 conversion right.

Escaping just those character is insufficient.  The appeal of this
approach is that authors don't need the right blacklist of dangerous
characters.  By the way, there are already folks doing something
similar manually now.  They send the untrusted bytes as base64 and
decode them using JavaScript.

On Thu, Aug 26, 2010 at 1:25 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> Sorta.  It'll let you put the data in <script>, but it won't verify that the
> data doesn't change the meaning of the script, obviously, or inject script
> of its own to run.

Because <script> does not decode entities in HTML, the attacker will
be limited to what he or she can do with alphanumeric characters, +,
/, and trailing =.  Of course, if the entity appears in a string
context (as is pretty common), the attacker won't be able to break out
of the string context, even by include </script> in the attack string
(which is a common vulnerability in hand-rolled escaping schemes).

On Thu, Aug 26, 2010 at 1:30 PM, Julian Reschke <julian.reschke at gmx.de> wrote:
> I now get the point about the additional problems in script, but I fail to
> see how the proposal addresses this, unless expanding these entities is
> suppose to happen *after* parsing the script.

Yes.  That's precisely what happens.

Kind regards,
Adam