[whatwg] Citing multiple <blockquote> elements in HTML5

Tue Dec 2 22:08:35 PST 2008

On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote:
> 
> When you read "The value must not contain any space characters.", is it 
> an authoring rule for conforming documents, for you? Ok.

Right, statements that place requirements on what the values must be are 
authoring requirements.

> When you read "*If the value is not the empty string, user agents must 
> associate the element with the given value (exactly, including any space 
> characters)* for the purposes of ID matching within the subtree the 
> element finds itself (e.g. for selectors in CSS or for the 
> |getElementById()| method in the DOM).", is it a parsing rule for 
> conforming user agents, for you? Ok.

Right, rules that say how to handle values are rules for implementors.

> But, isn't it worth to spend a word everywhere in the spec to tell when 
> it's a quirck for backward compatibility, which might go away in the 
> future, and when it's not, because that's not needed?

None of the implementation requirements in HTML5 will go away in the 
future. We will always have to define how implementation are to handle all 
inputs, today, tomorrow, and 100 years from now. Authors aren't going to 
stop writing invalid documents, unfortunately; and even if they did, the 
documents that exist today aren't going anywhere. (One of the goals of the 
HTML5 project is to document how someone in 2100 AD, or even 21000 AD, 
should handle Web pages of today, so that today's heritage isn't lost.)

> I mean, if you allow spacing characters inside an id value, as a parsing rule,
> you can face something like '<div id="foo bar" >', that is an id consisting of
> more than one token. Is it good to leave it in untouched? Yes? Ok, but what
> does it mean for CSS's, since there is a reference to them as one reason to
> allow space characters? That is, can a browser handle an id selector starting
> with the '#' character and being broken by a blank space?

Sure:

   #foo\ bar { ... }

...would match an element with id="foo bar".

> Or better, is it legal in CSS?

CSS doesn't care about the syntax of IDs, any string (except the empty 
string) can be used as an ID from CSS. However, CSS questions are out of 
scope for this mailing list, so I'll leave it at that.

> Now, let's say, instead, that a user agent, conforming with HTML 5 
> specifications, must cut off any token after the first one (I know 
> actually "foo bar" is taken as is), that is <div id="foo bar"> becomes 
> <div id="foo "> and <div id=" foo "> is valid too. In such a case, 
> skipping any spaces too, and stating the same behaviour for strings 
> passed to .getElementById() could be nice as a graceful degradation for 
> documents non-conforming with the rule "the value [of an id attribute] 
> must not contain any space characters", but such might fail with CSS 
> selectors such as 'div[id="foo bar"]'.

I don't follow you there. What problem are you trying to solve?

> Perhaps a compromise, if acceptable for backward compatibility, might be:
> - when the id value must be compared to a fragment identifier, strip any
> trailing space characters; if the match fails, escape any other space
> characters both in the id value and in the fragid and try again;

Why not just do what we do now, and treat the attribute as-is?

> - when an attribute is defined to hold an url and its value has spaces in its
> path/query/fragment, escape them before resolving the url (not sure if
> needed);

Again, aren't the current rules for handling URLs as defined in HTML5 
enough?

> - for the purpose of ID matching through the DOM 'getElementById' method,
> leave the id value untouched;
> - for the purpose of ID matching through CSS selectors accessing it as an
> attribute, leave the id value untouched;
> - for the purpose of ID matching through CSS selectors directly accessing it
> (e.g. '#foo') either choose the first sequence of non-spacing characters or
> let the match fail (I can't decide what's better, but perhaps the former would
> fail as well, since I guess anyone coding <div id="foo bar"> not only as a
> fragment identifier, but also for styling, might have the nice idea to write
> "#foo bar { font-weight : bold; }" as well).

These are out of scope for this working group, but if you think CSS or the 
DOM should change, then I recommend bringing up these issues with those 
groups.

> Anyway, if the id value is also a fragment identifier, which might have 
> space characters (since parsing rules prescribe to add such characters 
> to the unreserved production), does the (authoring) rule "the value must 
> not contain any space characters" make sense?

Sure, why wouldn't it make sense? If IDs have spaces in them, you can't 
refer to them from space-separated lists of IDs, so to avoid authoring 
problems, authors will want to be told when they acidentally use spaces.

> Now let's come to the duplicated ids issue. Again, what's what? When 
> it's said, "The id attribute represents its element's unique identifier. 
> *The value must be unique in the subtree within which the element finds 
> itself and must contain at least one character.*", I think that's what 
> you call an authoring rule. So, I don't think it was so bad to ask for a 
> clarification on the subtree nature. And if a subtree happened to match, 
> eventually, an element subtree inside a document, was the suggestion for 
> a getElementById method on the HTMLElement interface so awful? 

What terminology would you prefer rather than "subtree"? (We can't say 
document, since we are also trying to define conformance rules for 
disconnected subtrees handled from scripts.)

> Otherwise, let's consider (again) the second paragraph:
> 
> "If the value is not the empty string, user agents must associate the element
> with the given value (exactly, including any space characters) *for the
> purposes of ID matching within the subtree the element finds itself (e.g. for
> selectors in CSS or for the |getElementById()| method in the DOM).*"
> 
> It's a parsing rule, isn't it? But it tells also the id must be unique 
> in the whole document for the purpose of ID matching through the 
> getElementById() method in the DOM, because the only object capable to 
> get an element by its id is an instance of the Document interface. So, 
> any choice should be taken on what to do with duplicated ids. Solving 
> the question at the parser level (i.e. defaulting any duplicated id to 
> the empty string) would be consistent with both the fragment identifier 
> behaviour (only the first occurrence is valid) and the uniqueness rule, 
> but might brake some semantics (i.e. an hyperlink used to create an 
> instance of a <dfn>, or a <blockquote> with a cite attribute referencing 
> a <cite> element, both with a duplicated id not being the first 
> occurrence). On the other hand, leaving the duplicated id in the 
> document requires some changes in the Document's getElementById() 
> method, since the W3C DOM Core does not define a unique behaviour in 
> such a case, and I've expressed a few dubts on solving this by adding an 
> equivalent method on the HTMLDocument interface; anyway the 
> getElementById() behaviour must be defined for such situations, and 
> having it to pick the first match may be a solution (but might cause 
> side/unwanted effects if misused in actual documents, and leaves no 
> chance to access directly to any element with a duplicated id, but if 
> I'm not careful when choosing an ID, I can complain just with myself... 
> - anyway, the uniqueness fulfillment might become problematic when 
> dinamically putting together pieces of code, perhaps from different 
> sources, e.g. using XMLHTTPRequests, or because of externally syndicated 
> contet, but this is in the scope of careful programming).

The getElementById() method will be defined more precisely than the vague 
wording in the DOM specs. I believe Simon Pieters is working on that.

> From the point of view of CSS, both choices may be consistent with 
> coupled rules such as "#foo { font-size : 13; }" and #foo { font-size : 
> 14; }", since both would refer to the same element because of cascading 
> rules; on the other side, something like 'div[id="foo"] {/*something 
> here*/}' or a direct reference to an ID selector as a descendant of 
> different elements might perhaps isolate different elements in the 
> document (whether to allow such or not is outside html scope - but are 
> such cases in the wild?), and for the purpose of compatibility with 
> document styled that way, leaving duplicated ids in the document would 
> be a better choice. But, in such cases, shouldn't the DOM elements 
> selection be consistent with the CSS elements selection (i.e. to avoid 
> side-effects when CSS rules manipulate the DOM itself)? That is, if 
> through CSS it were possible to reach elements with duplicated ids in 
> different subtrees of a document tree (according to the definition of 
> all nodes descendant of a non-leaf node as being part of its subtree) 
> and to manipulate their content, shouldn't it be possible through the 
> DOM too?

CSS doesn't search for a single match for IDs, it just looks for whether 
an element matches the selector or not. So it doesn't care if there are 
duplicates. But anyway, CSS is out of scope for this mailing list.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'