[whatwg] Fwd: Entity parsing

Fri Apr 24 15:14:31 PDT 2009

On 23 May 2008, at 03:50, Ian Hickson wrote:

> On Thu, 28 Jun 2007, Øistein E. Andersen wrote:
>>
>> 1) Is it useful to handle unterminated entities followed by an
>> alphanumerical character like IE does? [...]
>>
>> 2) HTML 4.01 allows the semicolon to be omitted in certain cases.
>> [...] Firefox and Safari both support this, and it would
>> seem meaningless to change the way conforming documents are parsed
>> [...]
>>
>> 3) Will new entities ever be needed? If yes, can new entities adopt
>> existing conformance criteria and parsing rules?
>>
>> [...]
>
> New entities have since been added, and the rules for parsing entities
> (sorry, "named character references") have been changed a bit.  
> However, I
> am reluctant to change this from what we have now, since what we  
> have now
> works well. How strongly do you feel about this?

I think I may have expressed my concern in rather too abstract terms  
previously.

The named character references currently present in HTML5 can be  
subdivided (roughly) into the following subsets:

	IE4 < HTML4 < HTML5

Approximately 100 named character references are included in the IE4  
set, 200 in the HTML4 set, and 2,000 in the HTML5 set.

When a named character reference is followed by a semicolon, it  
clearly has to be expanded, but how to handle non-semicolon-terminated  
character references is less obvious.

Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named  
character reference from the IE4 (resp. HTML4, HTML5) set, and let .  
(full stop) represent any character other than semicolon, and ^  
(circumflex) any character which is (roughly) not an ASCII letter or  
digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of  
character references to expand (outside of attribute values) include:

	1) &IE4^
	2) &IE4.
	3) &HTML4^
	4) &IE4. &HTML4^
	5) &HTML4.
	6) &IE4. &HTML5^
	7) &HTML4. &HTML5^
	8) &HTML5.

(The set of character references to be expanded in attribute values  
could be obtained by replacing . by ^ above.)

Currently, Opera follows 1), IE 2), and Safari and Firefox 3).

My main concern is that &HTML4^ is actually legitimate in HTML4 and  
works in both Safari and Firefox today, and that HTML5 should not  
change the rendering of valid HTML4 pages unless there is a good  
reason to do so.

4) does not break any valid HTML4 pages and does also not cause any  
character references to be expanded which are not already expanded in  
either IE or both Safari and Firefox, so this should be possible to  
implement.

[Options 5), 6) and 8) can, to a greater or lesser extent, be  
specified more easily, but might be too controversial. There are pages  
relying on, e.g., `10&ndash20' to work, though, so handling character  
references in a more liberal way would actually have some benefits;  
only invalid mark-up would be affected in any case; and the negative  
effects are to a certain extent compounded by the more conservative  
treatment in attribute values.  That being said, I do of course  
realise that it will be seen as safer not to expand too many character  
references as long as the actual impact remains difficult to quantify.]

-- 
Øistein E. Andersen