[whatwg] Valid Unicode

Sam Ruby rubys at intertwingly.net
Sat Dec 2 17:47:15 PST 2006

On 12/2/06, Henri Sivonen <hsivonen at iki.fi> wrote:
> On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> > It would not be wise for HTML5 to limit itself to the more constrained
> > character set of XML.  In particular, the form feed character is
> > pretty popular,

BTW, I copy and pasted the wrong table.  The characters I mentioned
were discouraged (and include such things as Microsoft smart quotes
mislabeled as iso-8859-1).  The actual allowed set in XML 1.0 is as

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

For XML 1.1 the list is as follows:

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

> > This is yet another case where "take HTML5, read it into a DOM, and
> > serialize it as XML, and voilà: you have valid XHTML" doesn't work.
> What I am advocating is making sure that *conforming* HTML5 documents
> can be serialized as XHTML5 without dataloss.

Then you will also need to disallow newlines in attribute values.

In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.

- Sam Ruby

