[whatwg] Possible Sub-Delimiters in Fragments (Was: Feature Request: Media Elements as Targets for Links)

Nils Dagsson Moskopp nils at dieweltistgarnichtso.net
Wed Dec 19 20:50:06 PST 2012


Philip Jägenstedt <philipj at opera.com> schrieb am Wed, 19 Dec 2012
11:19:14 +0100:

> […]
>
> Redefining/extending what the fragment component does for HTML is
> somewhat risky, so it really comes down to what exactly the
> processing should be.
> 
> What should a browser do with a URL ending with #foo&t=10 if there is
> an element on the page with id="foo&t=10"? What about #foo&bar if
> there is an element with id="foo"? I would be surprised if treating
> #foo& the same as #foo were Web compatible...

I wrote a script to exctract attribute values and the archive of 8915
HTML pages (<http://html5accessibility.com/HTML5data/html.zip>). The
script and files containing all id and href attributes can be found at
<http://daten.dieweltistgarnichtso.net/src/htmlattrib>.

Following are some results for some possible sub-delimiters between
element id and media fragment, generated using the following script:

===== snip =====

#/bin/sh
echo `grep "$1" html-attrib-id -c` \
  id attributes containing “"$1"”
echo `grep "$1\S*=" html-attrib-id -c` \
  id attributes containing “"$1"” followed by something containing “=”
echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1" -c` \
  href attributes containing “"$1"” in fragment
echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1\S*=" -c` \
  href attributes containing "$1" followed by something containing “=” \
  in fragment

===== snap =====

From the data set, it seems to me that U+007E TILDE would a pretty safe
choice for separation of element id and media fragment if processing
should be kept to a minimum (just splitting on the delimiter).

With more elaborate processing (only splitting on the delimiter if a
U+003D EQUALS SIGN appears after it), we might also use:
- U+0021 EXCLAMATION MARK
- U+0027 APOSTROPHE
- U+002A ASTERISK
- U+002C COMMA
- U+003B SEMICOLON
- U+0040 COMMERCIAL AT

I did check for characters U+0028 LEFT PARENTHESIS and U+0029 RIGHT
PARENTHESIS but did not include the results for aesthetic reasons. My
shell script also does weird things when given an argument of U+002D
HYPHEN-MINUS so that is missing as well.

Any faults in my reasoning? Also, where do I get a bigger data set?


[Boring stuff follows]


Regarding U+0021 EXCLAMATION MARK:

4 id attributes containing “!”
0 id attributes containing “!” followed by something containing “=”
2232 href attributes containing “!” in fragment
630 href attributes containing ! followed by something containing “=”
in fragment

Regarding U+0024 DOLLAR SIGN:

558023 id attributes containing “$”
1 id attributes containing “$” followed by something containing “=”
89837 href attributes containing “$” in fragment
0 href attributes containing $ followed by something containing “=” in
fragment

Regarding U+0026 AMPERSAND:

78 id attributes containing “&”
56 id attributes containing “&” followed by something containing “=”
1362 href attributes containing “&” in fragment
1346 href attributes containing & followed by something containing “=”
in fragment

Regarding U+0027 APOSTROPHE:

23 id attributes containing “'”
0 id attributes containing “'” followed by something containing “=”
339 href attributes containing “'” in fragment
9 href attributes containing ' followed by something containing “=” in
fragment

Regarding U+002A ASTERISK:

19 id attributes containing “*”
0 id attributes containing “*” followed by something containing “=”
18 href attributes containing “*” in fragment
0 href attributes containing * followed by something containing “=” in
fragment

Regarding U+002B PLUS SIGN:

28 id attributes containing “+”
1 id attributes containing “+” followed by something containing “=”
93 href attributes containing “+” in fragment
19 href attributes containing + followed by something containing “=” in
fragment

Regarding U+002C COMMA:

130 id attributes containing “,”
0 id attributes containing “,” followed by something containing “=”
428 href attributes containing “,” in fragment
10 href attributes containing , followed by something containing “=” in
fragment

Regarding U+003B SEMICOLON:

88 id attributes containing “;”
0 id attributes containing “;” followed by something containing “=”
222 href attributes containing “;” in fragment
8 href attributes containing ; followed by something containing “=” in
fragment

Regarding U+0040 COMMERCIAL AT:

8 id attributes containing “@”
0 id attributes containing “@” followed by something containing “=”
208 href attributes containing “@” in fragment
15 href attributes containing @ followed by something containing “=” in
fragment

Regarding U+007E TILDE:

2 id attributes containing “~”
0 id attributes containing “~” followed by something containing “=”
1 href attributes containing “~” in fragment
0 href attributes containing ~ followed by something containing “=” in
fragment

-- 
Nils Dagsson Moskopp // erlehmann
<http://dieweltistgarnichtso.net>



More information about the whatwg mailing list