[whatwg] Valid Unicode
hsivonen at iki.fi
Sat Dec 2 15:42:11 PST 2006
On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> It would not be wise for HTML5 to limit itself to the more constrained
> character set of XML. In particular, the form feed character is
> pretty popular,
> This is yet another case where "take HTML5, read it into a DOM, and
> serialize it as XML, and voilà: you have valid XHTML" doesn't work.
What I am advocating is making sure that *conforming* HTML5 documents
can be serialized as XHTML5 without dataloss. This is important in
order to be able to promise that an "XML tool chain" can be used for
processing *conforming* HTML5 by sticking an HTML5 parser in front of
the processing pipeline (for *non-browser* use cases like data
mining, content management or conformance checking where scripts
aren't executed nor CSS rendering performed). The motivation is to
make processing HTML5 in non-browser apps less expensive without
giving an incentive for the solutions to violate the spec ad hoc on
For example, an "XML tool chain" is important enough for my
conformance checking service that if at this point the assumption of
*conforming* HTML5 being convertible to XHTML5 was broken in corner
cases, I'd probably come up with ad hoc trickery for masking it
instead of throwing away the tool chain. I'd prefer not having to do
that and not having to explain to everyone else who finds an "XML
tool chain" to be of value what tricks I needed to pull off to fake it.
I am not suggesting that HTML5 browsers halt and catch fire upon
finding a form feed. And it is obvious that lossless conversion of
all possible non-conforming HTML5 documents to XML is impossible
anyway, so making that a goal would not be worthwhile.
But what legitimate and popular use would a form feed have in HTML5?
Why can't we call it non-conforming? Are there use cases other than
converting .txt RFCs to HTML with regexps without bothering to get
rid of the form feeds?
hsivonen at iki.fi
More information about the whatwg