[whatwg] Space characters
Ian Hickson
ian at hixie.ch
Thu Jun 14 18:09:28 PDT 2007
On Mon, 6 Nov 2006, Henri Sivonen wrote:
> On Nov 6, 2006, at 07:34, Ian Hickson wrote:
> > On Sun, 5 Nov 2006, Henri Sivonen wrote:
> > >
> > > Is there a reason why the definition of space characters does not
> > > match the XML 1.0 and RELAX NG definition of white space (space,
> > > tab, CR, LF) but also includes (line tabulation and form feed)? Is
> > > the deviation from XML 1.0 needed for backwards compatibility with
> > > text/html UAs?
> >
> > I made the parser consider VT and FF as being whitespace based on, as
> > I recall, a complete examination of every Unicode character's
> > behaviour in the parsers I was testing. The definition of "space
> > characters" matches the parser's behaviour for consistency.
> >
> > The definition of "space characters" doesn't affect the XML parser
> > stage as far as I can recall, only attribute parsing and DOM
> > conformance.
>
> The potential problem with it affecting DOM conformance is that it may
> have ripple effects to running XML tooling inside a browser engine.
> Gecko has an XPath implementation. Disruptive Innovations has created a
> RELAX NG implementation for Gecko. Running the schemas from
> syntax.whattf.org on a DOM inside Gecko would be interesting, since it
> would allow checking DOM snapshots modified by scripts. There may be
> other reasons to run XML machinery on an HTML DOM in a browser. Both
> XPath and RELAX NG assume that white space-separated tokens follow the
> XML notion of white space. Not being able to use the native XPath and
> RELAX NG notions of splitting on white space would be seriously uncool.
> Of course, a browser engine might get away with tampering with the XPath
> or RELAX NG notions of white space since the additional characters don't
> occur in XML. But does it make sense to inflict the cost of such
> tweaking on the XML parts of browser engines?
>
> Would there be serious compatibility problems if the HTML5 parsing
> algorithm required VT and FF to be mapped to space (after expanding
> NCRs) and the higher-level parts of the spec defined white space as
> space, tab, CR and LF?
Well, I don't much care about VT, but I really think we should round-trip
form feed. Consider, for instance, RFCs, which have form feeds. I don't
like the idea of dropping them on the floor when you convert RFCs to HTML
and back to text again.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list