[whatwg] URL spec and IDN
Anne van Kesteren
annevk at annevk.nl
Mon Mar 17 04:13:24 PDT 2014
On Wed, Feb 19, 2014 at 11:53 PM, Joshua Cranmer <Pidgeot18 at verizon.net> wrote:
> I've noted that the URL specification is currently rather vague when it
> comes to IDN, and has some uncertain comments about issues related to
> IDNA2003, IDNA2008, and UTS #46.
Yeah, it is a clusterfuck. I'm working with the guys behind UTS #46 on
cleaning it up, but due to vacation et al it's taking some time.
> Roughly speaking, in my experience, there are three kinds of labels:
> A-labels, U-labels, and "displayable" U-labels. A-labels are the
> Punycode-encoded version of the labels used for DNS (normalized to ASCII
> lower-case, naturally). U-labels are the results of converting an A-label to
> Unicode. "Displayable" U-labels are A-labels converted to U-labels only if
> they do not contain a Unicode homograph attack. My drawing a distinction
> between the "displayable" U-label and the regular kind is out of concern
> that the definition of "displayable" may change over time (e.g., certain
> script combinations are newly permitted/prohibited), whereas the U-label
> derived from an A-label should be constant.
Agreed. At some point we should make this clearer in the specification.
> Given these three kinds of labels, it ought to be possible (IMO) to convert
> a generic domain in any (i.e., unnormalized) format. The specification
> currently provides for a domainToASCII and a domainToUnicode function which
> naturally map to producing A-labels and U-labels, but contains a note
> suggesting that they shouldn't be implemented due to the "IDNA clusterfuck."
> The way to a "displayable" U-label would seem to me to come most naturally
> via |new URL("http://" + domain).host|.
No, we should have a dedicated domainToUI() or some such. A parsed URL
contains A-labels. We might want to have something similar for URLs
themselves. To convert percent-encoding and such.
> Looking at the spec, it's not clear if the host, href, and other methods are
> supposed to return U-labels or A-labels (or some potential mix of the two).
It's A-labels, see http://url.spec.whatwg.org/#concept-host-parser for details.
> I'm guessing the reason why the domainTo* methods are unspecified are due to
> inconsistent handling of IDNA2008 by current web browsers, ...
> * Chrome's documentation calls out ignoring STD3 rules (i.e., permitting
> more ASCII characters) and disallowing unassigned code points. IE's
> documentation does not suggest what they do here.
You want to allow e.g. "_" as that is used by subdomains. However, if
you ignore STD3, you need additional checks later on to prevent
reparsing issues. The URL Standard calls out the specific code points
that are problematic here.
> 1. Expressly identify how to normalize and process an IDN address under
> IDNA2008 + UTR #46 + other modifications that reflects reality. I'm not
> qualified to know what happens at precise edge cases here.
Yeah this is the plan, once UTR #46 has some changes I proposed. See
http://www.unicode.org/review/pri264/ if you're interested.
> 2. Resolve that URL should reflect U-labels as much as possible while
> placing the burden of avoiding Unicode homograph attacks on the browser
> implementors rather than JS consumers of the API.
Currently it's A-labels. Mostly because all other parts of the URL are
More information about the whatwg