[whatwg] Valid Unicode

Tue Apr 22 04:18:33 PDT 2008

On Fri, 1 Dec 2006, Elliotte Harold wrote:
>
> In 9.1.3 we see
> 
> Text must consist of valid Unicode characters other than U+0000. Text should
> not contain control characters other than space characters.
> 
> 
> Later in 9.2.3.1 we find:
> 
> If the number is not a valid Unicode character (e.g. if the number is higher
> than 1114111), or if the number is zero, then return a character token for the
> U+FFFD REPLACEMENT CHARACTER character instead.
> 
> 
> I do not think the Unicode spec defines the notion of a "valid Unicode
> character". (It does define a valid Unicode code unit sequence, but that's a
> little different. A code unit sequence generally consists of more than one
> character.) Thus I suggest we need to be more precise here about what is and
> is not a valid Unicode character.

The spec is much more precise now. Is it ok?

> In particular:
> 
> 1. Are private use characters allowed?

Yes.

> 2. Are control characters allowed (probably yes, based on other parts of 
> the spec).

No as raw characters. Control characters that aren't in U+80-U+9F are 
allowed as entities.

> 3. Are surrogate characters allowed? (probably no)

No.

> 4. Are non-characters beyond 10FFFF allowed (no)

No.

> 5. Are reserved but currently undefined characters allowed (yes)

Yes.

> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> 7. Are the noncharacters from the last two characters of each plane 
> allowed (?)

Not as raw charactes but, for now, as entities yes.

On Sun, 3 Dec 2006, Henri Sivonen wrote:
> On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> > 
> > It would not be wise for HTML5 to limit itself to the more constrained 
> > character set of XML.  In particular, the form feed character is 
> > pretty popular,
> > 
> > This is yet another case where "take HTML5, read it into a DOM, and 
> > serialize it as XML, and voilà: you have valid XHTML" doesn't work.
> 
> What I am advocating is making sure that *conforming* HTML5 documents 
> can be serialized as XHTML5 without dataloss. This is important in order 
> to be able to promise that an "XML tool chain" can be used for 
> processing *conforming* HTML5 by sticking an HTML5 parser in front of 
> the processing pipeline (for *non-browser* use cases like data mining, 
> content management or conformance checking where scripts aren't executed 
> nor CSS rendering performed). The motivation is to make processing HTML5 
> in non-browser apps less expensive without giving an incentive for the 
> solutions to violate the spec ad hoc on their own.
> 
> For example, an "XML tool chain" is important enough for my conformance 
> checking service that if at this point the assumption of *conforming* 
> HTML5 being convertible to XHTML5 was broken in corner cases, I'd 
> probably come up with ad hoc trickery for masking it instead of throwing 
> away the tool chain. I'd prefer not having to do that and not having to 
> explain to everyone else who finds an "XML tool chain" to be of value 
> what tricks I needed to pull off to fake it.
> 
> I am not suggesting that HTML5 browsers halt and catch fire upon finding 
> a form feed. And it is obvious that lossless conversion of all possible 
> non-conforming HTML5 documents to XML is impossible anyway, so making 
> that a goal would not be worthwhile.
> 
> But what legitimate and popular use would a form feed have in HTML5? Why 
> can't we call it non-conforming? Are there use cases other than 
> converting .txt RFCs to HTML with regexps without bothering to get rid 
> of the form feeds?

I don't think that it would be valuable to make that use case raise 
errors.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'