[whatwg] Space characters
Henri Sivonen
hsivonen at iki.fi
Mon Nov 6 02:37:01 PST 2006
On Nov 6, 2006, at 07:34, Ian Hickson wrote:
> On Sun, 5 Nov 2006, Henri Sivonen wrote:
>>
>> Is there a reason why the definition of space characters does not
>> match
>> the XML 1.0 and RELAX NG definition of white space (space, tab,
>> CR, LF)
>> but also includes (line tabulation and form feed)? Is the
>> deviation from
>> XML 1.0 needed for backwards compatibility with text/html UAs?
>
> I made the parser consider VT and FF as being whitespace based on,
> as I
> recall, a complete examination of every Unicode character's
> behaviour in
> the parsers I was testing. The definition of "space characters"
> matches
> the parser's behaviour for consistency.
>
> The definition of "space characters" doesn't affect the XML parser
> stage
> as far as I can recall, only attribute parsing and DOM conformance.
The potential problem with it affecting DOM conformance is that it
may have ripple effects to running XML tooling inside a browser
engine. Gecko has an XPath implementation. Disruptive Innovations has
created a RELAX NG implementation for Gecko. Running the schemas from
syntax.whattf.org on a DOM inside Gecko would be interesting, since
it would allow checking DOM snapshots modified by scripts. There may
be other reasons to run XML machinery on an HTML DOM in a browser.
Both XPath and RELAX NG assume that white space-separated tokens
follow the XML notion of white space. Not being able to use the
native XPath and RELAX NG notions of splitting on white space would
be seriously uncool. Of course, a browser engine might get away with
tampering with the XPath or RELAX NG notions of white space since the
additional characters don't occur in XML. But does it make sense to
inflict the cost of such tweaking on the XML parts of browser engines?
Would there be serious compatibility problems if the HTML5 parsing
algorithm required VT and FF to be mapped to space (after expanding
NCRs) and the higher-level parts of the spec defined white space as
space, tab, CR and LF?
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list