[whatwg] Space characters

Ian Hickson ian at hixie.ch
Thu Jun 14 18:09:28 PDT 2007

On Mon, 6 Nov 2006, Henri Sivonen wrote:
> On Nov 6, 2006, at 07:34, Ian Hickson wrote:
> > On Sun, 5 Nov 2006, Henri Sivonen wrote:
> > > 
> > > Is there a reason why the definition of space characters does not 
> > > match the XML 1.0 and RELAX NG definition of white space (space, 
> > > tab, CR, LF) but also includes (line tabulation and form feed)? Is 
> > > the deviation from XML 1.0 needed for backwards compatibility with 
> > > text/html UAs?
> > 
> > I made the parser consider VT and FF as being whitespace based on, as 
> > I recall, a complete examination of every Unicode character's 
> > behaviour in the parsers I was testing. The definition of "space 
> > characters" matches the parser's behaviour for consistency.
> > 
> > The definition of "space characters" doesn't affect the XML parser 
> > stage as far as I can recall, only attribute parsing and DOM 
> > conformance.
> The potential problem with it affecting DOM conformance is that it may 
> have ripple effects to running XML tooling inside a browser engine. 
> Gecko has an XPath implementation. Disruptive Innovations has created a 
> RELAX NG implementation for Gecko. Running the schemas from 
> syntax.whattf.org on a DOM inside Gecko would be interesting, since it 
> would allow checking DOM snapshots modified by scripts. There may be 
> other reasons to run XML machinery on an HTML DOM in a browser. Both 
> XPath and RELAX NG assume that white space-separated tokens follow the 
> XML notion of white space. Not being able to use the native XPath and 
> RELAX NG notions of splitting on white space would be seriously uncool. 
> Of course, a browser engine might get away with tampering with the XPath 
> or RELAX NG notions of white space since the additional characters don't 
> occur in XML. But does it make sense to inflict the cost of such 
> tweaking on the XML parts of browser engines?
> Would there be serious compatibility problems if the HTML5 parsing 
> algorithm required VT and FF to be mapped to space (after expanding 
> NCRs) and the higher-level parts of the spec defined white space as 
> space, tab, CR and LF?

Well, I don't much care about VT, but I really think we should round-trip 
form feed. Consider, for instance, RFCs, which have form feeds. I don't 
like the idea of dropping them on the floor when you convert RFCs to HTML 
and back to text again.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list