[whatwg] Fwd: Entity parsing

Fri Jul 17 18:38:10 PDT 2009

On 5 Jun 2009, at 00:49, Ian Hickson wrote:
>
> Could you give an example of what you mean? I'm having trouble  
> following
> your description

> On Fri, 24 Apr 2009, Øistein E. Andersen wrote:
>>
>>
>> Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named
>> character reference from the IE4 (resp. HTML4, HTML5) set,

&IE4 includes &eacute, ï
&HTML4 includes in addition &pi, œ and
&HTML5 includes in addition &SHcy, &rcaron.

>> and let .
>> (full stop) represent any character other than semicolon, and ^
>> (circumflex) any character which is (roughly) not an ASCII letter or
>> digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of
>> character references to expand (outside of attribute values) include:
>>
>> 	1) &IE4^
               e.g., caf&eacute (café)
>>
>> 	2) &IE4.
               e.g., na&iumlve (naïve)
>>
>> 	3) &HTML4^
               e.g., 2&pi (2π)
>>
>> 	4) &IE4. &HTML4^
               e.g., na&iumlve (naïve), 2&pi (2π)
>>
>> 	5) &HTML4.
               e.g., hors d'&oeliguvre (hors d'œuvre)
>>
>> 	6) &IE4. &HTML5^
               e.g., na&iumlve (naïve), &SHcy(A/K) [Ш(A/K)]
>>
>> 	7) &HTML4. &HTML5^
               e.g., hors d'&oeliguvre (hors d'œuvre), &SHcy(A/K)  
[Ш(A/K)]
>>
>> 	8) &HTML5.
               e.g., Dvo&rcaron&aacutek (Dvořák)
>>
>> [...]
>> Currently, Opera follows 1),
      i.e., expands caf&eacute, but not na&iumlve or 2&pi
>> IE 2),
      i.e., expands caf&eacute and na&iumlve, but not &2pi
>> and Safari and Firefox 3).
      i.e., expands caf&eacute and 2&pi, but not na&iumlve
>>
>>
>> My main concern is that &HTML4^ is actually legitimate in HTML4 and
>> works in both Safari and Firefox today, and that HTML5 should not  
>> change
>> the rendering of valid HTML4 pages unless there is a good reason to  
>> do
>> so.

Non-semicolon-terminated entities that were conforming in HTML4, like  
&pi and &mdash when they are not followed by a letter or digit  
(roughly speaking), are currently expanded in Safari and Firefox, and  
requiring this to change would be a regression affecting existing pages.

> As far as I can tell HTML5 more or less matches what legacy pages  
> need,

You keep repeating this, and also that much work has been done to get  
entity parsing right and that you really do not want to change it.  It  
seems to me that you have tried to follow IE's behaviour closely,  
which is not completely unreasonable.  I have not seen evidence of any  
analysis of legacy pages supporting this decision, though; on the  
contrary, more or less anecdotal evidence sent to the mailing list(s)  
seems to suggest that certain modifications might make the algorithm  
work better for legacy pages. Replicating IE may well be good enough  
and seems like a reasonably safe option, but HTML5 does not completely  
follow IE in other areas, and I do not quite see why entity parsing  
should be treated differently.

-- 
Øistein E. Andersen