[whatwg] Web Addresses vs Legacy Extended IRI (again)

Giovanni Campagna scampa.giovanni at gmail.com
Sun Mar 29 06:31:38 PDT 2009


2009/3/29 Anne van Kesteren <annevk at opera.com>:
> On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna
> <scampa.giovanni at gmail.com> wrote:
>>
>> 2009/3/29 Anne van Kesteren <annevk at opera.com>:
>>>
>>> I'm not sure if you're correct about those differences, but even if you
>>> are they are not the only differences. E.g. LEIRIs perform normalization if
>>> the input encoding is non-Unicode. URLs do not. URLs can encode their query
>>> component per the input encoding (and do so for HTML and some APIs).
>>> LEIRIs cannot.
>>
>> What is the problem with normalization? Is there a standard for
>> conversion to non-Unicode to Unicode?
>> I guess no, so normalization (which should always be done) is perfectly
>> legal.
>
> It's about Unicode Normalization. (And it should not always be done.)

If I convert from ISO-8859-1 and find "À" (decimal 192), I can emit
"À" U+00C0 LATIN CAPITAL A WITH GRAVE or "A" U+0041 LATIN CAPITAL
LETTER A followed by " ̀" U+0300 COMBINING GRAVE ACCENT
One is NFC, the other is NFD, and both are legal and simple.


>> In addition, IRIs are defined as a sequence of Unicode codepoints. It
>> does not matter how those codepoints are stored (ASCII, ISO-8859-1,
>> UTF-8), only the Unicode version of them.
>
> Please read the IRI specification again. Specifically section 3.1.

Specification says that IRIs must be a in normalized UCS when created
from user input, else it must be converted to Unicode if not already
(and the conversion should be normalizing), else it must be converted
from UTF-8  / 16 / 32 to UCS but not normalized.
I don't see a particular problem in this.

>> This is the same as URL5s, by the way, because none of them is defined
>> on octets and both use the RFC3986 method for percent-encoding (using
>> UTF-8)
>
> No, it's not always using UTF-8.

RFC3986 never creates percent encoding (percent-encoding is used for
unspecified binary data) but says that text components should be
encoded as UTF-8 and that rules are estabilished by scheme specific
syntaxes.

> --
> Anne van Kesteren
> http://annevankesteren.nl/
>

Giovanni



More information about the whatwg mailing list