[whatwg] multipart/form-data filename encoding: unicode and special characters

Ashley Sheridan ash at ashleysheridan.co.uk
Tue May 1 19:38:12 PDT 2012


On Tue, 2012-05-01 at 21:12 -0400, Evan Jones wrote:

> I am not an experienced web standards wonk, so please forgive me if I'm making a mistake here.
> 
> When uploading files that contain special characters in their name, it appears to me that it is unspecified as to how those file names should be escaped. As a result, Webkit/Safari/Chrome appear to handle these filenames in one way, while Firefox handles them in another. I'm implementing the server side of this equation, and it is unclear to me what I should be doing. Am I missing something? Webkit even has a bug on this issue that states "I suggest working with WHATWG or HTML WG to get something specified in HTML5, and getting browsers converge on that." Is anyone working on this?
> 
> 
> EXAMPLE
> 
> Create a file named: bàz'\"hi%22.txt  eg. using the unix command: touch bàz\'\\\"hi%22.txt
> 
> 
> Firefox (13.0 beta on Mac) sends the following header, backslash escaping the double quote but not escaping the backslash.
> 
> Content-Disposition: form-data; name="somefile"; filename="bàz'\\"hi%22.txt"
> 
> 
> Webkit (latest nightly r115711 on Mac): %-escapes the double quote, but does nothing to the literal %
> 
> Content-Disposition: form-data; name="somefile"; filename="bàz'\%22hi%22.txt"
> 
> 
> THE SPECS: HTML5 states:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#multipart-form-data
> 
> Encode the (now mutated) form data set using the rules described by RFC 2388. […] File names […] must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. […]). User agents must not use the RFC 2231 encoding suggested by RFC 2388.
> 
> 
> … this seems contradictory: Encode using RFC 2388, but do not using the encoding suggested by the RFC. Worse, no browser actually follows the RFC (e.g. they all use UTF-8 encoded parameter values), so that doesn't seem like the right answer. Is there a way out of this mess?
> 
> Evan
> 
> --
> http://evanjones.ca/
> 


Although an issue with this test case, I would argue what valid problem
this may cause. It does implement many characters which are considered
unsafe in the most popular operating system file system (windows either
NTFS or FAT32), and therefore by association operating systems in which
the user is probably (even unconsciously) avoiding those characters
purely for interoperability reasons.

The Webkit method looks the better of the two with regards to how
server-side languages might interpret it, but it would need work to
ensure everything that should be escaped is, and that everything that is
unescaped on the server should be and is done so correctly.

-- 
Thanks,
Ash
http://www.ashleysheridan.co.uk





More information about the whatwg mailing list