[whatwg] External document subset support
Brett Zamir
brettz9 at yahoo.com
Thu Jun 18 20:53:04 PDT 2009
Ian Hickson wrote:
> On Mon, 18 May 2009, Brett Zamir wrote:
>
>> Section 10.1, "Writing XHTML documents" observes: "According to the XML
>> specification, XML processors are not guaranteed to process the external
>> DTD subset referenced in the DOCTYPE."
>>
>> While this is true, since no doubt the majority of web browsers are
>> already able to process external stylesheets or scripts, might the very
>> useful feature of external entity files, be employed by XHTML 5 as a
>> stricter subset of XML (similar to how XML Namespaces re-annexed the
>> colon character) in order to allow this useful feature to work for XHTML
>> (to have access to HTML entities or other useful entities for one, as
>> well as enable a poor man's localization, etc.)?
>>
>
> While there are arguments on both sides of whether this is a good idea or
> not, I think the more important concern in this case is whether we can
> extend XML in this way. I think in practice we should leave this up to the
> XML specs and their successors. I don't think it would be appropriate for
> us to profile the XML spec in this way.
>
>
While it is not my purpose to extend the debate on external DTD's, I
wanted to bring up the following points (brought to light after a recent
re-review of the spec) because it raises a few serious issues which I
believe current browsers are failing at, and if the browsers do not
address these issues, they would make claims for real XHTML 5 support
(as with XHTML 1.* and plain XML support) unworkable. While I agree that
any changes to XML itself should be up to the XML specs, from what I can
now tell, it looks like a closer adherence to the existing spec would
solve most of the existing problems. I wanted to share the following
points which I think could resolve most of the issues, if the browsers
would make the required changes.
I was pleasantly surprised to find that the spec seems to recommend
solutions which I believe avoid the more serious issue of single point
of failure problems.
(The other complaints with DTD's, such as avoiding cross-domain DTDs for
the sake of security or avoidance of DOS attacks might be an optional
issue if that may, in combination with adhering to existing
recommendations, satisfy concerns, though I personally do not think such
a risk is similar to inclusion of cross-domain scripts.)
So what follows is what I have gleaned from these various statements as
applied to current browsers. I can provide specific citations, but I did
not wish to expand this post unnecessarily (though I list references at
the end).
The major issues which I think ought to be resolved by certain browsers,
as they do not seem to be in accord with the XML spec and as a result,
create interoperability problems:
1) Firefox and Webkit, should not give a single point of failure for a
missing entity as they do now, (unless they switch to a validating
parser which finds no declaration in the external file and the user is
in validation mode), since such failures in a document with an external
DTD are NOT well-formedness errors unless the document deliberately
declares standalone=yes.
2) Explorer, which no longer seems to require in IE8 that the document
be completely described by the DTD as I believe it had earlier (though
it will report errors if the document violates rules which are
specified), should, per the spec, really only report validation errors
upon user option (ideally, I would say, off by default, and activatable
on a case-by-case as well as preference-based basis). This will possibly
speed things up if the option could be disabled as well as let their
browser work with documents which violate validation. But this issue is
not as serious as #1, since #1 prevents even valid documents from being
interoperably viewed on the web.
If these issues are addressed by those aiming for compliance, the only
disadvantages which will remain (and which are inherent in XML by
allowing the co-existence of validating and non-validating parsers) are
those issues described in http://www.w3.org/TR/REC-xml/#safe-behavior
and http://www.w3.org/TR/REC-xml/#proc-types , namely that:
1) some (entity-related) /well-formedness/ errors (e.g., if an entity is
not defined but is used) will go hidden to a non-validating parser as
these will not need to load an entity replacement (which is not a big
problem, since a document author should presumably have checked (with an
application which does external entity substitution) that their entities
integrate properly with the text--it is not as important, however, that
they check for /validation/ errors, since as mentioned above, these need
only be reported optionally).
2) The application may possibly not be notified by its processor of,
e.g., entity replacement values, if it is a non-validating processor
(though non-validating processors can also make such replacements). But
since these are, as mentioned above, not to produce well-formedness
errors, there is no single point of failure here either (though there
may be some missing content, but indicated by an entity reference in the
output display).
3) A few validation issues, such as duplicate declarations (which might
include attribute defaults) can lead to undefined behavior (though given
that validation is only optional even for validating applications, it
seems all applications will have to deal with this).
In other words, as the spec seems to indicate from my reading, users
going from one browser to the other will not face problems, unless:
1) They visit invalid documents and have the option to validate the
document turned on (it is only supposed to be an option) and expect
other browsers to report the same errors as well (not a big issue, since
a document which describes its validation constraints and then breaks
them is basically asking for trouble--and even here, the user is
supposed to have the option to view the document without validation).
2) They expect to see the entity replacement text (and at least, this is
not a single point of failure, and in many cases, such as when entities
are merely used to represent symbols, the text can be fully read without
any disruption in the document flow). Of course, doing the replacements
would be even better to avoid this problem, and the solution does not
require supporting validation.
There are also the following optional issues which browsers might wish
to consider (though if these are not implemented, the above fixes alone
would address the most serious problems):
1) Since even a non-validating processor is to inform the application
that it recognized but did not read an entity (if it does not replace
their references with content found in an external DTD), a browser like
Opera (the only one that I can tell does not report such issues, even
though it correctly does not lead to a single point of failure), might
(if not implementing #2 below) wish to consider doing so, since a
compliant processor at least is supposed to report such issues to the
application (to do with it as it sees fit). But there is admittedly no
obligation on the application to do so, and in any case, such reporting
is not to be a single point of failure. But it still might be nice to
distinguish the display of entities which are not found from
deliberately escaped entities (e.g., &myEnt; produced by a missing
entity currently appears the same in Opera (except in source view) as a
deliberately escaped &myEnt;)
2) Opera, Firefox, and Webkit (after the latter two fix the more serious
issue mentioned above) might also wish to consider expanding their XML
support for their users to:
a) Show a link to optionally expand each external parsed entity
references or other entities (if they don't do the following)
b) Build on a non-validating parser to do automatic entity and
default attribute value replacement, and attribute value normalization
using an external DTD (at least same domain ones). The XML spec only
warns against relying on this for the sake of an application having the
freedom to switch between non-validating parsers which may or may not
all take these actions--this issue doesn't impact interoperability for
users (it only improves it), however, so even if there is no desire to
support validation, they can still offer entity replacement, etc. to
their users.
c) Implement a validating parser which can do entity and default
attribute value replacement, and attribute value normalization from an
external DTD, as well as optionally validate the document at user
discretion. This should not slow things down for the user, since the
spec itself indicates that reporting of validation errors is required
"at user option". This would give the user the best of both worlds--the
opportunity to fully read XML/XHTML files online (and without any
requirement to face a validation performance cost), and if they are, for
example, a document author, they could choose to take a client-side
performance hit to optionally check for validation. Of course, they'll
need to load the external files in either of these cases to be able to
do the replacements, but the document author will NOT need to provide
full DTD validation in the external DTD, so users will not be forced to
download DTDs reflecting the whole document structure, unless the
document author wishes to reference such files). Indeed, authors might
be encouraged not to include such content in their DTDs (performing
validation offline) so that they and their users can reduce bandwidth,
unless their purpose is to transparently show the validation (though DTD
validation is of course not very strong).
References:
http://www.w3.org/TR/REC-xml/#wf-entdeclared
http://www.w3.org/TR/REC-xml/#proc-types
http://www.w3.org/TR/REC-xml/#safe-behavior
http://www.w3.org/TR/REC-xml/#dt-vc (validity constraint definition)
http://www.w3.org/TR/REC-xml/#include-if-valid
regards,
Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20090619/75c72431/attachment-0002.htm>
More information about the whatwg
mailing list