[whatwg] External document subset support

Brett Zamir brettz9 at yahoo.com
Sun May 24 22:35:57 PDT 2009


Henri Sivonen wrote:
> On May 18, 2009, at 11:50, Brett Zamir wrote:
>
>> Henri Sivonen wrote:
>>> On May 18, 2009, at 09:36, Brett Zamir wrote:
>> Also, as far as heavy server loads for frequent DTDs, entities could 
>> be deliberately not defined at a resolvable URL.
>
> There are existing XML doctypes out there with resolvable URIs, so 
> you'd need a blacklist to bootstrap such a solution.
>
As you suggest on your site, 'If, for legacy reasons, you must process 
some well-known DTDs, please make your entity resolver retrieve those 
DTDs from a local catalog." I would think the big browsers would be 
fully capable of doing this (as XML allows for by distinguishing public 
and system identifiers), and for any which exploded in popularity before 
obtaining a public identifier, I would imagine a blacklist could work.
>> The same problems of denial-of-service could exist with stylesheet 
>> requests, script requests, etc.
>
> No, styles and scripts are commonly site-specific, so there isn't a 
> Web-wide single point of failure whose URI gets copied around as 
> boilerplate.
>
Well, again, as mentioned below, they can be of wider use, but I see 
your point that the effects on other sites would indeed most likely be 
stronger if the source site went down. While I think that's a risk they 
should be free to take (just as if people want to share or rely on 
external scripts), but if there's enough feeling against that, the issue 
could be addressed by requiring browsers to only access same domain.
>> Even some sites, like Yahoo, have encouraged referring to their 
>> frequently accessed external files to take advantage of caching.
>
> At least the serving infrastructure for those URIs has been designed 
> for high load unlike the server for many existing DTD URIs out there. 
Again, I say either let them take the risk if they actually make a 
likely popular DTD to be available, allow a blacklist, or if need really 
be, limit to the same domain.
> Furthermore, JS libraries have obvious functionality in existing 
> browsers, so it's unlikely that authors would reference JS libraries 
> as part of boilerplate without actually intending to take the perf hit 
> of loading the library.
>
Presumably most XML users will be including doctypes which include a 
public identifier. Use of lesser known XML dialects will probably 
presume some knowledge of what is happening, and even then, the official 
provider of the dialect, will probably know not to provide their DTD 
directly as a referenceable DTD.
>> The spec could even insist on same-domain, though I don't see any 
>> need for that.
>
> Without same-origin (as in not even performing a CORS GET), you'd need 
> to blacklist at least w3.org due to existing references out there. 
Sounds fine, though I am assuming w3.org references already have a 
PUBLIC identifier for their DTDs.
> (Note that for security, same-origin/CORS is must-have anyway.)
>
A must-have if you don't trust the origin, yes. But plenty of sites 
include scripts from other sites for ads or analysis. It would not be 
such a big loss in the case of DTDs to restrict to same domain, however.
>> I also disagree with throwing our hands up in the air about character 
>> entities (or thinking that the (English-based) HTML ones are 
>> sufficient).
>
> That's a text input method issue that needs to be solved on the 
> authoring side for text input of all kind--not just text input for 
> writing XML in a text editor.
>
So, what's wrong with doing it in XML? If you're saying that text 
editors need to better support Unicode, then sure, but that's not a 
complete solution, given the cumbersomeness of finding obscure 
characters, etc. which can more simply be defined once in a DTD and 
forgotten. It's a nice feature for a text format which can be created 
across a variety of editors.
>> Moreover, the browser with the largest market share offers such 
>> support already, and those who depend on it may already view other 
>> browsers not supporting the standard as "broken".
>
> IE doesn't support XHTML or SVG which are the popular XML formats one 
> might want to load into a browsing context.
>
Again, if there is an offline use, there is a browsing use. Just because 
not everyone is rushing to use XML in this way, does not mean that a lot 
of people would not like to share especially their document-centric XML 
in such a fashion (and even data-centric XML).

Yes, a Firefox/Opera/Safari user who tries XHTML in IE will find it 
"broken", while a user of Firefox, etc. visiting an XML file dependent 
on an external DTD will find it broken. Firefox/Opera/Safari should be 
free to offer this positive feature to their users, even if IE doesn't 
come on board (to their eventual detriment I would think), while I would 
hope Firefox et al would implement this one feature on top of their 
already existing support for showing XML as a tree. As I said, IE is 
offering functionality which other browser users will think is broken in 
their browser--I think that is due to these browsers not having gone far 
enough, rather than IE having gone too far; just because the spec 
technically makes it optional, doesn't mean entity resolution for at 
least same-domain system-only-identified DTD's shouldn't become the de 
facto standard given the features it offers.

>>> Loading same-origin DTDs for the purpose of localization is a 
>>> semi-defensible case, but it's a lot of complexity for a use case 
>>> that is way on the wrong side of 80/20 on the Web scale.
>> How so?
>
> Localized sites are a minority on the Web, and chances that localized 
> Web apps would switch to a client-side localization method that relies 
> on server-side negotiation of the localization and requires XML to 
> work seem dim.
>
Maybe, but it is also very easy to use. I would hope browsers (and the 
specs guiding their collective behavior) could consider the convenience 
for document authors. Firefox developers, for example, are well familiar 
with them and some are eager to use them for remote XUL. I've seen an 
increasing number of .xhtml extension documents already out in the wild, 
despite a lack of support in IE, and despite such a change (without 
customizable external DTD's) offering arguably less benefits to the 
document creator than easy localization (though XHTML could also benefit 
from such DTD localization as well).
>> Even if it is a niche group which uses TEI, Docbook, etc. or who 
>> wants to be able to build say a browser extension which can take 
>> advantage of their rich semantics, this is still a use for citizens 
>> of the web.
>
> If you need a browser extension for content, you shut out users of 
> browsers that don't have the particular extension available. It's like 
> using Flash.
>
While I agree that having to use an extension would limit the usefulness 
(that's why I'm so passionate about seeing browsers implement it), I'm 
talking about extensions that build interesting optional interfaces to 
that content--for example to perform an XQuery on the content (I've made 
a Firefox extension which does this) or to give a simple interface 
allowing users to highlight content or search only within special 
semantic tags (e.g., <date/>, <said/>, <bibl/>, etc. tags in TEI). But I 
very much agree that browsers should all implement the basic 
infrastructure: 1) XML tree for non-formatted XML, 2) CSS rendering of 
pure XML, 3) External DTD support, 4) Recognition of dialects like XHTML 
within larger XML fragments, and they're already almost there.

Beyond this being about open technologies, it is also about being able 
to innovate. Even Flash can be supplanted over time by open standards, 
not to mention specialized languages with a much smaller audience. but 
using TEI isn't really going to break anything as long as you can at 
least load and view the document. Yes, there is a concern of 
babelization of semantics, but that is only a concern for document 
authors, and again I don't think XHTML can or should fill all semantic 
markup needs.
>> If people can push forward with backwards-incompatible technologies 
>> like the video element, 3d-animation, or whatever, it seems not much 
>> to ask to support the humble external entity file... :)
>
> The upside of video and 3D is much more significant than the upside of 
> supporting external DTDs.
>
So animation is more important than Shakespeare? A lot of classical 
literature is richly encoded in XML languages like TEI. No doubt the 
readers of Shakespeare are fewer than those of video and 3d, but I don't 
think that means they are less important, especially when the 
implementation must, I would imagine, be quite a bit easier as well.
>>> Besides, if the use case for DTDs is localization within an origin, 
>>> the server can perform the XML parse and reserialize into DTDless 
>>> XML. (That's how I've implemented this pattern in the past without 
>>> client-side support.)
>>>
>> That is assuming people are aware of scripting and have access to 
>> such resources.
>
> Localization with DTDs but without scripting is already tricky, since 
> one would need to tweak conneg. 
Sorry, I'm not aware what you mean here. JavaScript scripting could 
support cases of dynamic localization if DOM methods like 
document.createEntityReference() were implemented along with external 
DTD support.
>
>> Wasn't it one of the aims of the likes of XSL, XQuery, and XForms to 
>> use a syntax which doesn't require knowledge of an unrelated 
>> scripting language (and those are pretty complex examples unlike 
>> entities)?
>
> Web browsers don't support XSL-FO, XQuery or XForms.
I for one hope they will. There seems to be a fair amount of interest in 
XForms at the very least. But my point is that it seems to be a W3C goal 
(and a good one) to make technologies which avoid a need for specialized 
scripting knowledge or services.
> (XSLT support isn't something that can be generalized to feature 
> triage policy applicable to new features today.)
>
Sorry, I don't follow.
>> (Btw, you and I discussed this before, though I didn't get a response 
>> from you to my last post: 
>> https://bugzilla.mozilla.org/show_bug.cgi?id=22942#c109 ; I don't 
>> mean to go off-topic but you might wish to consider or respond to 
>> some of its points as well...)
>
> Oh. I didn't make the connection. I didn't reply there, because using 
> Bugzilla as a discussion forum--particularly when the discussion turns 
> to advocacy--is frowned upon. 
I thought we were addressing rationales related to the legitimacy of 
implementing the bug, but all right.
> Are there some particular points that I haven't addressed here that 
> you'd like to re-raise?
>
I think we're mostly rehashing it anyways. :)

best wishes,
Brett


More information about the whatwg mailing list