[whatwg] Possible Sub-Delimiters in Fragments (Was: Feature Request: Media Elements as Targets for Links)
Nils Dagsson Moskopp
nils at dieweltistgarnichtso.net
Wed Dec 19 20:50:06 PST 2012
Philip Jägenstedt <philipj at opera.com> schrieb am Wed, 19 Dec 2012
11:19:14 +0100:
> […]
>
> Redefining/extending what the fragment component does for HTML is
> somewhat risky, so it really comes down to what exactly the
> processing should be.
>
> What should a browser do with a URL ending with #foo&t=10 if there is
> an element on the page with id="foo&t=10"? What about #foo&bar if
> there is an element with id="foo"? I would be surprised if treating
> #foo& the same as #foo were Web compatible...
I wrote a script to exctract attribute values and the archive of 8915
HTML pages (<http://html5accessibility.com/HTML5data/html.zip>). The
script and files containing all id and href attributes can be found at
<http://daten.dieweltistgarnichtso.net/src/htmlattrib>.
Following are some results for some possible sub-delimiters between
element id and media fragment, generated using the following script:
===== snip =====
#/bin/sh
echo `grep "$1" html-attrib-id -c` \
id attributes containing “"$1"”
echo `grep "$1\S*=" html-attrib-id -c` \
id attributes containing “"$1"” followed by something containing “=”
echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1" -c` \
href attributes containing “"$1"” in fragment
echo `grep '#' html-attrib-href | cut -d'#' -f2 | grep "$1\S*=" -c` \
href attributes containing "$1" followed by something containing “=” \
in fragment
===== snap =====
From the data set, it seems to me that U+007E TILDE would a pretty safe
choice for separation of element id and media fragment if processing
should be kept to a minimum (just splitting on the delimiter).
With more elaborate processing (only splitting on the delimiter if a
U+003D EQUALS SIGN appears after it), we might also use:
- U+0021 EXCLAMATION MARK
- U+0027 APOSTROPHE
- U+002A ASTERISK
- U+002C COMMA
- U+003B SEMICOLON
- U+0040 COMMERCIAL AT
I did check for characters U+0028 LEFT PARENTHESIS and U+0029 RIGHT
PARENTHESIS but did not include the results for aesthetic reasons. My
shell script also does weird things when given an argument of U+002D
HYPHEN-MINUS so that is missing as well.
Any faults in my reasoning? Also, where do I get a bigger data set?
[Boring stuff follows]
Regarding U+0021 EXCLAMATION MARK:
4 id attributes containing “!”
0 id attributes containing “!” followed by something containing “=”
2232 href attributes containing “!” in fragment
630 href attributes containing ! followed by something containing “=”
in fragment
Regarding U+0024 DOLLAR SIGN:
558023 id attributes containing “$”
1 id attributes containing “$” followed by something containing “=”
89837 href attributes containing “$” in fragment
0 href attributes containing $ followed by something containing “=” in
fragment
Regarding U+0026 AMPERSAND:
78 id attributes containing “&”
56 id attributes containing “&” followed by something containing “=”
1362 href attributes containing “&” in fragment
1346 href attributes containing & followed by something containing “=”
in fragment
Regarding U+0027 APOSTROPHE:
23 id attributes containing “'”
0 id attributes containing “'” followed by something containing “=”
339 href attributes containing “'” in fragment
9 href attributes containing ' followed by something containing “=” in
fragment
Regarding U+002A ASTERISK:
19 id attributes containing “*”
0 id attributes containing “*” followed by something containing “=”
18 href attributes containing “*” in fragment
0 href attributes containing * followed by something containing “=” in
fragment
Regarding U+002B PLUS SIGN:
28 id attributes containing “+”
1 id attributes containing “+” followed by something containing “=”
93 href attributes containing “+” in fragment
19 href attributes containing + followed by something containing “=” in
fragment
Regarding U+002C COMMA:
130 id attributes containing “,”
0 id attributes containing “,” followed by something containing “=”
428 href attributes containing “,” in fragment
10 href attributes containing , followed by something containing “=” in
fragment
Regarding U+003B SEMICOLON:
88 id attributes containing “;”
0 id attributes containing “;” followed by something containing “=”
222 href attributes containing “;” in fragment
8 href attributes containing ; followed by something containing “=” in
fragment
Regarding U+0040 COMMERCIAL AT:
8 id attributes containing “@”
0 id attributes containing “@” followed by something containing “=”
208 href attributes containing “@” in fragment
15 href attributes containing @ followed by something containing “=” in
fragment
Regarding U+007E TILDE:
2 id attributes containing “~”
0 id attributes containing “~” followed by something containing “=”
1 href attributes containing “~” in fragment
0 href attributes containing ~ followed by something containing “=” in
fragment
--
Nils Dagsson Moskopp // erlehmann
<http://dieweltistgarnichtso.net>
More information about the whatwg
mailing list