[whatwg] Entity parsing
Øistein E. Andersen
html5 at xn--istein-9xa.com
Sat Jun 23 15:34:40 PDT 2007
On 15 Jun 2007, at 2:5AM, Ian Hickson wrote:
> it's pragmatic (after all, why require the semicolon?), and is equivalent to
> not requiring quotes around attribute values.
This would be a good simile if it the optionality were systematic,
but it currently applies to a highly erratic set of entities (cf. ÿ vs &Yuml).
(The semicolon has now been made required, which is probably the only
sane option unless the parsing rules change substantially.)
On 23 Jun 2007, at 7:12PM, Sam Ruby wrote:
> Before, "A &mdash B" == "A — B", now "A &mdash B" == "A &mdash B".
> Is that what we really want? Testing with Firefox, the old behavior is preferable.
Personally, I would prefer something along these lines:
I. All entities are created equal (the burden of carrying a semicolon shall be
equally distributed amongst all).
II. Abuse of the semicolon shall not be legally enforced (its omission shall be
conforming unless it separates the entity from a following [ASCII] letter or digit).
III. Entities living in attribute values are to be treated as first-class citizens (the
same rules shall apply to them).
We clearly should, to the extent possible, try to avoid bizarre quirks, and the
current rules for entity parsing are not exactly straightforward or intuitive.
HTML5 currently follows IE7 much more closely than Safari, Firefox and Opera do,
which seems to suggest that some of the quirks could be dispensed with.
At any rate, web pages containing "&" + entity name followed by [^A-Za-z0-9]
are probably more likely not to have been authored for IE and therefore relying
on standard SGML behaviour, so it would probably be more backwards-
compatible to treat such occurrences as "&" + entity name + ";" (i.e., expand
Of course, conformance checkers would be more than welcome to signal that
a certain current browser is unable to handle "A &mdash B" as expected, but
this need not mean that all future browsers should be required not to handle
it "properly" (as per arguably [in the original sense] more sensible SGML rules).
Øistein E. Andersen
More information about the whatwg