[whatwg] Entity parsing

Øistein E. Andersen html5 at xn--istein-9xa.com
Mon Jun 25 17:50:39 PDT 2007


On 25 Jun 2007, at 8:28AM, Ian Hickson wrote:

> On Sun, 24 Jun 2007, Øistein E. Andersen wrote:
>
>> HTML5 currently follows IE7 much more closely than Safari, 
>>Firefox and Opera do, which seems to suggest that some of the quirks 
>>could be dispensed with.
>
> It's possible, though people kept pointing out problems, which is how we 
> ended up where we are now.

I have probably missed parts of this discussion, but most of the arguments
I have seen seem to rely on the assumption that whatever IE does is more
compatible with the Web as it is, which is probably a good approximation,
but replicating each single detail is not necessarily the best thing to do.

> Calling SGML "sensible" is a slippery slope! :-)

Sure, I did not mean to imply that all aspects of SGML are sensible :-)

(Bad connotations aside, SGML’s rules for optional semicolons
happen to be less contrived than IE’s.)

>> [It might be a good idea to accept a missing semicolon at the end of words.]
>
> Well, we'd have to prove this somehow with real research.

Yes, research is really missing here.

Whatever we do, some pages will break, and it is not a priori impossible
that a compromise of IE and SGML rules may be less quirky and more
compatible with existing content at the same time.

I am unable to do a proper corpus study on this, but the following
examples suggest that following IE blindly may not be optimal.
All markup is extracted from real Web pages, and the author’s intent
was quite obvious from the context. The numbers in parentheses indicate
the number of pages found using Google.


I] Should be expanded

    1) only SGML expands
            &mdash
                IE (incorrect): &mdash
                SGML (correct): —

    2) only IE expands
            fianc&eacutee (390), caf&eacutes (1,460), na&iumlve (716)
                IE (correct): fiancée, cafés, naïve
                SGML (incorrect): fianc&eacutee, caf&eacutes, na&iumlve

    3) neither expands
            &oeliguvre (719), c&oeligur (3,720)
                both (incorrect): &oeliguvre, c&oeligur
                intended: œuvre, cœur

II] Should not be expanded

    1) IE expands
            moral&ethics, roses&thorns
                IE (incorrect): moralðics, rosesþs
                SGML (correct): moral&ethics, roses&thorns

    2) SGML expands
            Alpha&Omega, once&forall
                IE (correct): Alpha&Omega, once&forall
                SGML (incorrect): AlphaΩ, once∀

    3) both expand
            rose&thorn
                both (incorrect): roseþ
                intended: rose&thorn


The examples I have found in category II] are all quite rare, but it is not unlikely
that more common ones exist.

Opera and Google both seem to err on the side of caution by only expanding
entities when both IE and SGML do, i.e., in case II.3) above.

It is also interesting to notice that reasonably common words belonging to class
I.2), which are handled by IE, are apparently no more frequent than words from I.3),
which no (popular) current browser handles correctly.

I am looking forward to seeing more extensive research on this.

-- 
Øistein E. Andersen




More information about the whatwg mailing list