[whatwg] Entity definitions in XHTML

Ian Hickson ian at hixie.ch
Thu Jan 17 15:31:34 PST 2013


On Thu, 17 Jan 2013, David Carlisle wrote:
> 
> http://www.w3.org/2003/entities/2007doc/xhtmlpubid.html
> 
> But basically it solves the problem that the existing list leads to a 
> situation where data corruption and user confusion are both inevitable 
> as the only way to enable entities to be loaded into a an xhtml agent is 
> to reference a DTD that defines a different incompatible set of 
> entities.

This seems to be predicated on the assumption that the proposed new 
identifier would identify a different DTD than the existing identifiers.

This is false. They would all identify the same DTD.


> > > The current list gives no way to specify the identifier of a 
> > > compatible set of entity definitions so makes it highly likely that 
> > > documents will be interpreted differently by an XHTML user agent and 
> > > a standard XML toolchain.
> > 
> > I do not understand what this means. Can you give an example?
> 
> Yes.  If for example you use ⟬ then in an XHTML User Agent if you 
> specify one of the blessed DTD Identifiers the HTML entity set will be 
> loaded and the entity will expand to U+27EC (MATHEMATICAL LEFT WHITE 
> TORTOISE SHELL BRACKET) as intended however this character was added at 
> Unicode 5.1 years after MathML2 and XHTML 1 specifically to support this 
> character so the definitions in the legacy DTD are different.

There's only one DTD that XHTML UAs are supposed to have in their 
catalogues at this point.


> Currently you have to specify the XHTML 1 DTD or MathML 2 DTD. If you 
> use the former then in any (normally configured) xml toolchain you will 
> get the XHTML 1 DTD the entity will not be defined and the entire 
> document is rejected with a fatal error. If you specify the latter then 
> the MathML2 DTD will be loaded and the entity will expand to the Asian 
> punctuation character U+3018 (LEFT WHITE TORTOISE SHELL BRACKET).

⟬ is defined to map to U+027EC in the DTD that the identifiers in 
the spec map to. If your tool chain is still using the legacy DTDs, just 
update your tool chain.


> > Fundamentally, I'd rather be removing these magic strings than adding 
> > more. If there's a compatibility need, then we should add it, but if 
> > the browsers don't already support the string, then there's no compat 
> > need that I can see.
> 
> It _used_ to be possible to reference a usable dtd. The MathML2 spec 
> worked in Firefox (every version up to 3) and Internet explorer and any 
> other browser of the period that I was aware of. It was your first 
> drafts of html(5) that introduced this bug by restricting the doctype 
> handling in a way that excluded any DTD that defined the correct set of 
> entities. Currently browsers have converged on that erroneous list.

The list in the spec was based on what browsers implemented.


> There is something very broken with the process if it is impossible to 
> fix bugs in the spec if some implementations implement the broken spec 
> text.

Welcome to the Web. Lots of things are broken due to this kind of thing... 
(pushState being my favourite example...)


> There is more to compatibility than compatibility between the browsers. 
> For XHTML there needs to be compatibility between Browsers and XML tools 
> (otherwise why use XML at all, I know you would rather people didn't but 
> so long as the spec allows then to it should not mandate a situation 
> that makes document corruption so likely).

There is no such mandate. The spec merely provides a catalogue of public 
identifiers and their modern meaning. Nothing stops XML users from using 
any other identifier, in particular SYSTEM identifiers. The spec 
discourages people from using DTDs in general, because of precisely the 
kinds of issues that are being discussed here, but the XML spec allows it, 
and that's what controls this at the end of the day (especially in the 
case of software that isn't using the HTML spec's catalogue).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



More information about the whatwg mailing list