[whatwg] Entity definitions in XHTML

Thu Jan 17 16:03:12 PST 2013

On 17/01/2013 23:31, Ian Hickson wrote:
> On Thu, 17 Jan 2013, David Carlisle wrote:
>>
>> http://www.w3.org/2003/entities/2007doc/xhtmlpubid.html
>>
>> But basically it solves the problem that the existing list leads to
>> a situation where data corruption and user confusion are both
>> inevitable as the only way to enable entities to be loaded into a
>> an xhtml agent is to reference a DTD that defines a different
>> incompatible set of entities.
>
> This seems to be predicated on the assumption that the proposed new
> identifier would identify a different DTD than the existing
> identifiers.

The proposed identifier _by definition_ identifies the list that is in
the HTML spec. Not surprising since you extract the list from the same
place.
>
> This is false. They would all identify the same DTD.
>

No, they don't. That is the trouble. Only the proposed one identifies
that list. The others are all pre-existing identifiers that identify
incompatible sets. It is fine in a browser context that you over-ride
that and load the HTML5 set in all cases but while you may control the
browser you can't control existing workflows that already use these
identifiers for the purposes for which they were defined, to identify
the XHTML and MathML2 DTD.

Browsers do not validate so can effectively
use an implicit catalog that switches in the data URL with the HTML
entities but since that contains no element definitions it would
completely break any XML tools that rely on validation.

>>>> The current list gives no way to specify the identifier of a
>>>> compatible set of entity definitions so makes it highly likely
>>>>  that documents will be interpreted differently by an XHTML
>>>> user agent and a standard XML toolchain.
>>>
>>> I do not understand what this means. Can you give an example?
>>
>> Yes.  If for example you use ⟬ then in an XHTML User Agent if
>> you specify one of the blessed DTD Identifiers the HTML entity set
>> will be loaded and the entity will expand to U+27EC (MATHEMATICAL
>> LEFT WHITE TORTOISE SHELL BRACKET) as intended however this
>> character was added at Unicode 5.1 years after MathML2 and XHTML 1
>> specifically to support this character so the definitions in the
>> legacy DTD are different.
>
> There's only one DTD that XHTML UAs are supposed to have in their
> catalogues at this point.

The only advantage of using XHTML as opposed to HTML syntax is that the
document is _not_ only parsed by XHTML specific UA but passes through
some general XML toolchain. The current list seems purpose designed to
break XML usage, it is also massively confusing for any human looking at
the file.

If you specify a DTD that defines the HTML entity set, no entities are
defined. If you specify a DTD which does not define them, they are all
defined. This is so obviously sub-optimal I honestly can't understand
how the bug can remain open for years after having been reported.

>
>
>> Currently you have to specify the XHTML 1 DTD or MathML 2 DTD. If
>> you use the former then in any (normally configured) xml toolchain
>>  you will get the XHTML 1 DTD the entity will not be defined and
>> the entire document is rejected with a fatal error. If you specify
>> the latter then the MathML2 DTD will be loaded and the entity will
>>  expand to the Asian punctuation character U+3018 (LEFT WHITE
>> TORTOISE SHELL BRACKET).
>
> ⟬ is defined to map to U+027EC in the DTD that the identifiers
>  in the spec map to. If your tool chain is still using the legacy
> DTDs, just update your tool chain.
>
>
>>> Fundamentally, I'd rather be removing these magic strings than
>>> adding more. If there's a compatibility need, then we should add
>>>  it, but if the browsers don't already support the string, then
>>> there's no compat need that I can see.
>>
>> It _used_ to be possible to reference a usable dtd. The MathML2
>> spec worked in Firefox (every version up to 3) and Internet
>> explorer and any other browser of the period that I was aware of.
>> It was your first drafts of html(5) that introduced this bug by
>> restricting the doctype handling in a way that excluded any DTD
>> that defined the correct set of entities. Currently browsers have
>> converged on that erroneous list.
>
> The list in the spec was based on what browsers implemented.

No. It is a subset of what mozilla did but bears no relation to what IE
did for example. But crucially mozilla also looked at the SYSTEM Id
(that is the URL) which allowed documents (eg the MathML2 spec) to use a
local dtd that defined an appropriate entity set as long as the local
dtd had "mathml" in its name. Special casing magic URL didn't make it
in to the spec (which is probably a good thing) but that combined with
the unfortunate list that doesn't include an identifier for the current
definitions completely broke existing XHTML use that was using the
entities and gives no reasonable way to fix it. (Other than not using
entities at all.)
(I've been advising people not to use entities in XHTML/MathML files
for 15 years but you more than anyone ought to know that users don't
always follow advice, and the system should accommodate users with other
priorities)

>
>
>> There is something very broken with the process if it is impossible
>> to fix bugs in the spec if some implementations implement the
>> broken spec text.
>
> Welcome to the Web. Lots of things are broken due to this kind of
> thing... (pushState being my favourite example...)

But in this case there is no request to remove an existing API, It is
just a request to add something that makes the web better and has
negligible ill effects.

>
>
>> There is more to compatibility than compatibility between the
>> browsers. For XHTML there needs to be compatibility between
>> Browsers and XML tools (otherwise why use XML at all, I know you
>> would rather people didn't but so long as the spec allows then to
>> it should not mandate a situation that makes document corruption so
>> likely).
>
> There is no such mandate. The spec merely provides a catalogue of
> public identifiers and their modern meaning. Nothing stops XML users
>  from using any other identifier, in particular SYSTEM identifiers.
> The spec discourages people from using DTDs in general, because of
> precisely the kinds of issues that are being discussed here, but the
>  XML spec allows it, and that's what controls this at the end of the
>  day (especially in the case of software that isn't using the HTML
> spec's catalogue).
>
As I note above there are many existing systems using the Public
identifiers of XHTML1 to refer to the XHTML1 DTD and using validating
parsers. They can not simply switch in a catalog that makes their
existing document collections invalid. So they can not make documents
using the XHTML1 public identifier load a DTD other than XHTML1 DTD.

David