[whatwg] Entity definitions in XHTML

David Carlisle davidc at nag.co.uk
Thu Jan 17 13:02:55 PST 2013


On 17/01/2013 18:58, Ian Hickson wrote:
> On Thu, 17 Jan 2013, David Carlisle wrote:
>>
>> By adding
>>
>> "-//W3C//ENTITIES HTML MathML Set//EN//XML"
>>
>> To the list in
>>
>> 13.2 Parsing XHTML documents
>>
>> Of Identifiers that are recognised when parsing XHTML syntax documents.
>
> What problem does this solve?

We tried to spell out various problems in the referenced document at

http://www.w3.org/2003/entities/2007doc/xhtmlpubid.html

But basically it solves the problem that the existing list leads to a 
situation where data corruption and user confusion are both inevitable 
as the only way to enable entities to be loaded into a an xhtml agent is 
to reference a DTD that defines a different incompatible set of entities.


>
>
>> The current list gives no way to specify the identifier of a compatible
>> set of entity definitions so makes it highly likely that documents will
>> be interpreted differently by an XHTML user agent and a standard XML
>> toolchain.
>
> I do not understand what this means. Can you give an example?

Yes.  If for example you use ⟬ then in an XHTML User Agent if you 
specify one of the blessed DTD Identifiers the HTML entity set will be 
loaded and the entity will expand to U+27EC (MATHEMATICAL LEFT WHITE 
TORTOISE SHELL BRACKET) as intended however this character was added at 
Unicode 5.1 years after MathML2 and XHTML 1 specifically to support this 
character so the definitions in the legacy DTD  are different.

Currently you have to specify the XHTML 1 DTD or MathML 2 DTD. If you 
use the former then in any (normally configured) xml toolchain you will 
get the XHTML 1 DTD the entity will not be defined and the entire 
document is rejected with a fatal error. If you specify the latter then 
the MathML2 DTD will be loaded and the entity will expand to the Asian 
punctuation character U+3018 (LEFT WHITE TORTOISE SHELL BRACKET).

The sole purpose of the requested chain is to allow the document to 
reference a set of entity definitions that matches the definitions that 
will be used in the browser.

>
>
> Fundamentally, I'd rather be removing these magic strings than adding
> more. If there's a compatibility need, then we should add it, but if the
> browsers don't already support the string, then there's no compat need
> that I can see.

It _used_ to be possible to reference a usable dtd. The MathML2 spec 
worked in Firefox (every version up to 3) and Internet explorer and any 
other browser of the period that I was aware of. It was your first 
drafts of html(5) that introduced this bug by restricting the doctype 
handling in a way that excluded any DTD that defined the correct set of 
entities. Currently browsers have converged on that erroneous list.

There is something very broken with the process if it is impossible to 
fix bugs in the spec if some implementations implement the broken spec text.


There is more to compatibility than compatibility between the browsers.
For XHTML there needs to be compatibility between Browsers and XML tools 
(otherwise why use XML at all, I know you would rather people didn't but 
so long as the spec allows then to it should not mandate a situation 
that makes document corruption so likely).

David






More information about the whatwg mailing list