[whatwg] Consecutive hyphen-minus characters in comments/in ACE-strings of IDNs
Ian Hickson
ian at hixie.ch
Thu Jan 6 17:10:26 PST 2011
On Tue, 2 Nov 2010, Martin Janecke wrote:
>
> In 10.1.6 Comments the current HTML spec
> http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#comments
> says:
>
> > Following this sequence, the comment may have text, with the additional
> > restriction that the text must not [...] contain two consecutive U+002D
> > HYPHEN-MINUS characters (--) [...]
>
> Section 5 of RFC 3490 http://tools.ietf.org/html/rfc3490#section-5
> defines the ACE-prefix in Internationalized Domain Names to be "xn--",
> i.e. always containing two consecutive hyphen-minus characters.
>
> This leads to the odd situation that correctly ASCII-compatible encoded
> IDNs cannot be used in HTML comments. For example, the wide-spread habit
> of commenting out parts of HTML code in web pages fails when the code
> contains those otherwise valid URLs. This really happens in practice
> when working with IDNs (my personal experience) and I assume this
> incompatibility will cause a growing number of pages to be invalid in
> future, as the number of used IDNs grows, which will happen for sure, as
> ICANN has approved internationalized top level domain names this year.
>
> Can the problems be prevented? E.g. by making "xn--" and "XN--" valid in
> comments?
>
> May it even be justified to make "--" valid in comments again? As I
> understand
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2006-May/006337.html
> and following replies, "--" used to be valid earlier in the spec and was
> then changed to make HTML more compatible with SGML, although HTML(5) is
> explicitly not SGML anymore. Making "--" valid won't affect any
> previously valid or invalid HTML page in any negative way, will it?
The main reason, IIRC, that we have disallowed "--" in comments in
text/html is that they are disallowed in XML, and to help authors catch
cases where they are commenting out comments.
The question, I guess, is which of the following do we think is more
important:
* Helping authors not write HTML markup that might be hard to convert to
XML, and helping authors avoid nesting comments accidentally, by
flagging "--" sequences in comments
* Getting out of the way of authors who want to put "--" sequences in
comments, e.g. because they use "--" as a long dash (as I do all the
time!), or because they want to comment out punycoded URLs.
Currently the spec assumes the former is more important. Personally, I
think the latter is rather more useful, but then I use "--" as long
dashes all the time! When this was last studied, the weight of argument
was on the stricter "disallow --" side of things, presumably.
I'm open to changing this back; does anyone else have an opinion on this?
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list