[whatwg] Entity definitions in XHTML
David Carlisle
davidc at nag.co.uk
Thu Jan 17 13:02:55 PST 2013
On 17/01/2013 18:58, Ian Hickson wrote:
> On Thu, 17 Jan 2013, David Carlisle wrote:
>>
>> By adding
>>
>> "-//W3C//ENTITIES HTML MathML Set//EN//XML"
>>
>> To the list in
>>
>> 13.2 Parsing XHTML documents
>>
>> Of Identifiers that are recognised when parsing XHTML syntax documents.
>
> What problem does this solve?
We tried to spell out various problems in the referenced document at
http://www.w3.org/2003/entities/2007doc/xhtmlpubid.html
But basically it solves the problem that the existing list leads to a
situation where data corruption and user confusion are both inevitable
as the only way to enable entities to be loaded into a an xhtml agent is
to reference a DTD that defines a different incompatible set of entities.
>
>
>> The current list gives no way to specify the identifier of a compatible
>> set of entity definitions so makes it highly likely that documents will
>> be interpreted differently by an XHTML user agent and a standard XML
>> toolchain.
>
> I do not understand what this means. Can you give an example?
Yes. If for example you use ⟬ then in an XHTML User Agent if you
specify one of the blessed DTD Identifiers the HTML entity set will be
loaded and the entity will expand to U+27EC (MATHEMATICAL LEFT WHITE
TORTOISE SHELL BRACKET) as intended however this character was added at
Unicode 5.1 years after MathML2 and XHTML 1 specifically to support this
character so the definitions in the legacy DTD are different.
Currently you have to specify the XHTML 1 DTD or MathML 2 DTD. If you
use the former then in any (normally configured) xml toolchain you will
get the XHTML 1 DTD the entity will not be defined and the entire
document is rejected with a fatal error. If you specify the latter then
the MathML2 DTD will be loaded and the entity will expand to the Asian
punctuation character U+3018 (LEFT WHITE TORTOISE SHELL BRACKET).
The sole purpose of the requested chain is to allow the document to
reference a set of entity definitions that matches the definitions that
will be used in the browser.
>
>
> Fundamentally, I'd rather be removing these magic strings than adding
> more. If there's a compatibility need, then we should add it, but if the
> browsers don't already support the string, then there's no compat need
> that I can see.
It _used_ to be possible to reference a usable dtd. The MathML2 spec
worked in Firefox (every version up to 3) and Internet explorer and any
other browser of the period that I was aware of. It was your first
drafts of html(5) that introduced this bug by restricting the doctype
handling in a way that excluded any DTD that defined the correct set of
entities. Currently browsers have converged on that erroneous list.
There is something very broken with the process if it is impossible to
fix bugs in the spec if some implementations implement the broken spec text.
There is more to compatibility than compatibility between the browsers.
For XHTML there needs to be compatibility between Browsers and XML tools
(otherwise why use XML at all, I know you would rather people didn't but
so long as the spec allows then to it should not mandate a situation
that makes document corruption so likely).
David
More information about the whatwg
mailing list