[whatwg] Entity parsing

Wed Jun 27 16:24:07 PDT 2007

On 26 Jun 2007, at 4:35AM, Ian Hickson wrote:

> The informal research I did when updating the spec suggests that the 
> current state of the spec is what is better.

(It is difficult to say anything sensible without knowing either the nature
of the research undertaken or the options under consideration.)

> I don't really know how to do more research
> -- it's quite hard to programatically tell when an entity 
> should be expanded and when it shouldn't.

True, but this is not completely insurmountable — or, rather: useful information
can be extracted without necessarily making these decisions explicitly.

I do not know what you have done already, but something like the following
for each entity &ref; would be useful for the discussion:
    — total number of "&ref";
    — number of "&ref;";
    — number of "&ref" followed by /[a-zA-Z0-9]/;
    — the N most frequent matches of /[a-zA-Z0-9]*&ref[a-zA-Z0-9&]+/.

Without any real data, arguing, e.g., that conforming HTML 4.01 documents that are
currently handled correctly by Firefox and Safari must be handled differently
in the future for the sake of backwards compatibility is not really persuasive.

The only argument for following IE that I have been able to find in the archives
is the following in a post from Simon Pieters on 14th Aug 2006 in the thread
“Parsing Entities”:

> I guess that for compat with IE and the Web[1] we have to treat
> "R&eacutesum&eacute" as if it were "R&eacute;sum&eacute;". [...]
> [1] http://www.google.com/search?q=R%26eacutesum%C3%A9

The implication seems to be that R&eacutesum&eacute can be found on the Web
and therefore should be supported. But Google also tells us something else:

    (1) "r&eacutesumé": 572
    (2) +résumé: 114,000,000
    (3) r&eacute;sum&eacute -"r&eacute;sum&eacute;s": 16,300
    (4) +"rÃ©sumÃ©": 1,000

Actually, (1) does not only cover r&eacutesum&eacute, but also code like
r&amp;eacutesumé, so the number of occurrences that can be saved
by parser quirks is lower than 572.

As could be expected, (1) is quite rare compared to (2), all the correctly
encoded variants. Whether 0.0005% should be regarded as significant
(supposing that résumé is representative) may be a contentious issue, but
it is interesting to note that other errors — unwanted conversion of & to &amp;
in (3) and a typical encoding problem in (4) — are actually significantly
more common, and these cannot be corrected at all.

-- 
Øistein E. Andersen