[whatwg] Entity parsing

Ian Hickson ian at hixie.ch
Mon Jun 25 00:28:42 PDT 2007


On Sun, 24 Jun 2007, Øistein E. Andersen wrote:
> 
> Personally, I would prefer something along these lines:
> 
> I. All entities are created equal (the burden of carrying a semicolon 
> shall be equally distributed amongst all).

For authors, this is now the case.

For implementations, we are pretty much constrained by what IE does.


> II. Abuse of the semicolon shall not be legally enforced (its omission 
> shall be conforming unless it separates the entity from a following 
> [ASCII] letter or digit).

Well, I had that allowed before, but people complained. :-) For some of 
the entities, though, we have to have a semicolon, for compatibility. So 
if you want consistency, it has to be required everywhere.


> III. Entities living in attribute values are to be treated as 
> first-class citizens (the same rules shall apply to them).

Again, for authors this is done, but for compatibility reasons we're 
constrained on what we can say for implementations.


> We clearly should, to the extent possible, try to avoid bizarre quirks, 
> and the current rules for entity parsing are not exactly straightforward 
> or intuitive. HTML5 currently follows IE7 much more closely than Safari, 
> Firefox and Opera do, which seems to suggest that some of the quirks 
> could be dispensed with.

It's possible, though people kept pointing out problems, which is how we 
ended up where we are now.


> At any rate, web pages containing "&" + entity name followed by 
> [^A-Za-z0-9] are probably more likely not to have been authored for IE 
> and therefore relying on standard SGML behaviour, so it would probably 
> be more backwards- compatible to treat such occurrences as "&" + entity 
> name + ";" (i.e., expand the entity).

Well, we'd have to prove this somehow with real research.


> Of course, conformance checkers would be more than welcome to signal 
> that a certain current browser is unable to handle "A &mdash B" as 
> expected, but this need not mean that all future browsers should be 
> required not to handle it "properly" (as per arguably [in the original 
> sense] more sensible SGML rules).

Calling SGML "sensible" is a slippery slope! :-)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list