[whatwg] Citing multiple <blockquote> elements in HTML5

Wed Dec 3 13:01:43 PST 2008

Ian Hickson ha scritto:
> On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote:
>   
>> But, isn't it worth to spend a word everywhere in the spec to tell when 
>> it's a quirck for backward compatibility, which might go away in the 
>> future, and when it's not, because that's not needed?
>>     
>
> None of the implementation requirements in HTML5 will go away in the 
> future. We will always have to define how implementation are to handle all 
> inputs, today, tomorrow, and 100 years from now. Authors aren't going to 
> stop writing invalid documents, unfortunately; and even if they did, the 
> documents that exist today aren't going anywhere. (One of the goals of the 
> HTML5 project is to document how someone in 2100 AD, or even 21000 AD, 
> should handle Web pages of today, so that today's heritage isn't lost.)
>
>
>   

Ok, and agreed. Due to the nature of the web (and of web authors' 
practices), a strict conformance requirement (such as it might be for a 
C compiler) will never be a good idea.

>> I mean, if you allow spacing characters inside an id value, as a parsing rule,
>> you can face something like '<div id="foo bar" >', that is an id consisting of
>> more than one token. Is it good to leave it in untouched? Yes? Ok, but what
>> does it mean for CSS's, since there is a reference to them as one reason to
>> allow space characters? That is, can a browser handle an id selector starting
>> with the '#' character and being broken by a blank space?
>>     
>
> Sure:
>
>    #foo\ bar { ... }
>
> ...would match an element with id="foo bar".
>
>
>   

Right, now I remember... sorry for my mess...

>> Now, let's say, instead, that a user agent, conforming with HTML 5 
>> specifications, must cut off any token after the first one (I know 
>> actually "foo bar" is taken as is), that is <div id="foo bar"> becomes 
>> <div id="foo "> and <div id=" foo "> is valid too. In such a case, 
>> skipping any spaces too, and stating the same behaviour for strings 
>> passed to .getElementById() could be nice as a graceful degradation for 
>> documents non-conforming with the rule "the value [of an id attribute] 
>> must not contain any space characters", but such might fail with CSS 
>> selectors such as 'div[id="foo bar"]'.
>>     
>
> I don't follow you there. What problem are you trying to solve?
>
>   

Just trying to explain why I was suggesting such a behaviour (= 
stripping space characters) in my first message about that. I was 
wrongly ignoring the case of id="foo bar" and just concerning on id="  
foo ", but not confusing authoring and parsing rules (even if I admit 
sometimes I've strict conformance in mind). If the latter were the only 
"naughty boy" out there, perhaps stripping spaces might have had some 
sense (though not the best choice without touching other things maybe 
out of scope).
>   
>> Perhaps a compromise, if acceptable for backward compatibility, might be:
>> - when the id value must be compared to a fragment identifier, strip any
>> trailing space characters; if the match fails, escape any other space
>> characters both in the id value and in the fragid and try again;
>>     
>
> Why not just do what we do now, and treat the attribute as-is?
>
>
>   
>> - when an attribute is defined to hold an url and its value has spaces in its
>> path/query/fragment, escape them before resolving the url (not sure if
>> needed);
>>     
>
> Again, aren't the current rules for handling URLs as defined in HTML5 
> enough?
>
>   
>   

Maybe the first is wrong, and I'm still unsure of the second. My concern 
is, a character-by-character comparison between an id value and a 
fragment identifier may fail several ways. What for href="#foo bar " and 
id="foo bar "? Actual rules would strip the trailing space only for the 
href, so the matching would fail (but we might survive broken links). 
Escaping both, then comparing would succed, as well as first escaping 
then unescaping the href value before comparing (should it be pointed 
out, somewhere, that a fragment identifier must be unescaped before 
comparing to an id or a name? is it and I've missed it? - having space 
characters in the unreserved production means thy don't need to be 
escaped, but does it mean also they must be decoded from their 
pct-production, after parsing and for resolving?). As well, stripping 
the trailing spaces in both cases would succed, but would fail when 
comparing id="foo bar " with href="#foo bar%20" (which is a valid url, 
according with actual parsing rules), even with escaping rules (in this 
case the id value trailing space must stay there). And what about 
id="foo%20bar" in http://foo.example.org/foo.html  and  href="#foo bar" 
on the same page, or on a page having the same base URL, or a base 
element with href="http://foo.example.org/foo.html" ? My point is, since 
comparisons for matching purpose happen after the URL parsing and 
resolution, and the id value is not involved in such steps, 
character-by-character comparisons may fail without a prior 
normalization of both th fragment-identifier an the id value (or one of 
them). However, if the above is yet solved with parsing and resolving 
rules and I've misunderstood the spec, I retire all and apologize. Or, 
perhaps, must a valid url with a valid fragment, which is equivalent but 
not exactly matching an id value, be considered as a broken link?

>
>   
>> Anyway, if the id value is also a fragment identifier, which might have 
>> space characters (since parsing rules prescribe to add such characters 
>> to the unreserved production), does the (authoring) rule "the value must 
>> not contain any space characters" make sense?
>>     
>
> Sure, why wouldn't it make sense? If IDs have spaces in them, you can't 
> refer to them from space-separated lists of IDs, so to avoid authoring 
> problems, authors will want to be told when they acidentally use spaces.
>
>
>   

I'll try and make that point a bit clearer, since the reference to url 
parsing rules was wrong - the question is another.

That's because of the double nature of the id attribute as both an ID 
and a fragment identifier: according to RFC 3986, unless I have 
misunderstood anything there, after dividing an URI into its component, 
pct-triplets may be safely decoded (and should be to correctly interpret 
each component), thus "%20foo%20bar%20" and " foo bar " are equivalent 
and both valid as conforming dereferenced <fragment-identifier> 
components (while only the former is conforming as a part of a complete 
URI, since for rfc3986 spaces are not 'unreserved'), but the latter is a 
non conforming ID according to the rule "an id value must not contain 
any space characters", which is a somewhat restriction to the 
fragment-identifier conformance. As far as conforming user agents leave 
it as is, that's not a concern; anyway, formally, is it something to be 
solved or pointed out somehow, in the spec? When a validator/an 
authoring tool finds something like,

<!-- The following section is a review of Los Angeles inside an article 
about California
        - just to create a context for the example -->
<section id="Los Angeles" >
...
</section>

shall it only report the id value as mistaken, or has it to say also 
it's a valid fragment identifier if the author is setting the id as an 
anchor?

>
> What terminology would you prefer rather than "subtree"? (We can't say 
> document, since we are also trying to define conformance rules for 
> disconnected subtrees handled from scripts.)
>
>
>   

Uhm, it may depend on what kinds of manipulations you have in mind, 
whether the disconnected subtree must be anyway a whole document to 
fulfil the uniqueness rule, and perhaps also on what the subtree concept 
might be turned into by future DOM Core versions, so maybe just a 
clarification on what a subtree is with respect to both the document (as 
a tree) and the scripts handling possibilities might be enough, instead 
of searching a new terminology, just to 'scope' the id visibility. I 
mean, if the ID matching is relevant for scripts accessing the matching 
element through the getElementById() method, actually a document tree is 
always overlapping the concept of subtree, and a disconnected subtree 
must be a document without a browsing context; otherwise, if other dom 
manipulations are involved the concept of subtree may change, for 
instance a script might implement its own scanning routine, treating an 
id attribute as any other attribute and leading to the concept that any 
non-leaf node may be the root of a subtree (that is identifying a 
subtree with any possible document fragment); furthermore, a possible 
future version of DOM Core interfaces might move the getElementById 
method to the Node interface, leading to the same result. Thus, a 
generic definition of 'subtree' (or no definition, or a definition 
relying upon a specific DOM feature or on script handling) might result 
in a variable concept with a variable scope for the ID uniqueness, but 
might make sense in a working draft until at least a first definition of 
the Web DOM Core specification, or waiting for any reason arising to 
restrict or enlarge the concept; otherwise, if that's been stated with a 
large consensus that a subtree is always a document tree, the term might 
be changed into the expression "a document, with or without a browsing 
context", or (equivalently) be defined as "a document subtree having a 
node of type document as its root" (to cover the case of dynamically 
created documents). Otherwise, if a subtree can be either a whole 
document, or a document subtree detached from its owner document (i.e. a 
node removed from a document with its descendants, or a tree of nodes 
whose ownerDocument property is not defined or null), it might be 
defined just as such, leaving the term 'subtree' wherever it is now (but 
would such a manipulation be consistent with the - authoring - 
uniqueness rule when the subtree is inserted into an actual document?).
> The getElementById() method will be defined more precisely than the vague 
> wording in the DOM specs. I believe Simon Pieters is working on that.
>
>
>   

I acknowledge this.

> CSS doesn't search for a single match for IDs, it just looks for whether 
> an element matches the selector or not. So it doesn't care if there are 
> duplicates. But anyway, CSS is out of scope for this mailing list.
>
>   

I agree, and just wondered whether it may or may not be a concern for 
consistent manipulation through both the DOM and the CSS, but I can't 
focus a concrete example where such a concern might arise, not being a 
side effect of a bad programming out of scope for both CSS and DOM, and 
I also acknowledge that might be in the scope of Web DOM Core, since 
it's been established it's out of scope for HTML specific DOM (which 
doesn't define any basic elements properties and access methods, but 
just html-specifically targeted ones, and I found this is consistent 
with the choice to define some stand-alone interfaces instead of always 
inheriting from the basic counterparts).

 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

 Sponsor:
 Attiva Carta Eureka! Credito fino a 3.000€, rate da 20€ e zero costi di attivazione. Conviene!
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8429&d=3-12