[whatwg] Valid Unicode

Sam Ruby rubys at intertwingly.net
Sat Dec 2 08:24:36 PST 2006


On 12/1/06, Elliotte Harold <elharo at metalab.unc.edu> wrote:
> Henri Sivonen wrote:
>
> >> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> >> 7. Are the noncharacters from the last two characters of each plane
> >> allowed (?)
> >
> > I don't have particularly strong feelings here. Putting those characters
> > is HTML is a bad idea, but allowing them is not a problem for HTML5 to
> > XHTML5 conversion and they aren't a common problem like C1 controls.
>
> FFFE and FFFF are specifically forbidden by XML so they should probably
> be forbidden here too. I think the others are allowed.

Unicode (not XML) reserves U+D800 – U+DFFF as well as U+FFFE and U+FFFF.

XML 1.0 only allows the following characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where "take HTML5, read it into a DOM, and
serialize it as XML, and voilà: you have valid XHTML" doesn't work.

> --
> Elliotte Rusty Harold  elharo at metalab.unc.edu
> Java I/O 2nd Edition Just Published!
> http://www.cafeaulait.org/books/javaio2/
> http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

- Sam Ruby



More information about the whatwg mailing list