[whatwg] Valid Unicode
Ian Hickson
ian at hixie.ch
Tue Apr 22 04:18:33 PDT 2008
On Fri, 1 Dec 2006, Elliotte Harold wrote:
>
> In 9.1.3 we see
>
> Text must consist of valid Unicode characters other than U+0000. Text should
> not contain control characters other than space characters.
>
>
> Later in 9.2.3.1 we find:
>
> If the number is not a valid Unicode character (e.g. if the number is higher
> than 1114111), or if the number is zero, then return a character token for the
> U+FFFD REPLACEMENT CHARACTER character instead.
>
>
> I do not think the Unicode spec defines the notion of a "valid Unicode
> character". (It does define a valid Unicode code unit sequence, but that's a
> little different. A code unit sequence generally consists of more than one
> character.) Thus I suggest we need to be more precise here about what is and
> is not a valid Unicode character.
The spec is much more precise now. Is it ok?
> In particular:
>
> 1. Are private use characters allowed?
Yes.
> 2. Are control characters allowed (probably yes, based on other parts of
> the spec).
No as raw characters. Control characters that aren't in U+80-U+9F are
allowed as entities.
> 3. Are surrogate characters allowed? (probably no)
No.
> 4. Are non-characters beyond 10FFFF allowed (no)
No.
> 5. Are reserved but currently undefined characters allowed (yes)
Yes.
> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> 7. Are the noncharacters from the last two characters of each plane
> allowed (?)
Not as raw charactes but, for now, as entities yes.
On Sun, 3 Dec 2006, Henri Sivonen wrote:
> On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> >
> > It would not be wise for HTML5 to limit itself to the more constrained
> > character set of XML. In particular, the form feed character is
> > pretty popular,
> >
> > This is yet another case where "take HTML5, read it into a DOM, and
> > serialize it as XML, and voilà: you have valid XHTML" doesn't work.
>
> What I am advocating is making sure that *conforming* HTML5 documents
> can be serialized as XHTML5 without dataloss. This is important in order
> to be able to promise that an "XML tool chain" can be used for
> processing *conforming* HTML5 by sticking an HTML5 parser in front of
> the processing pipeline (for *non-browser* use cases like data mining,
> content management or conformance checking where scripts aren't executed
> nor CSS rendering performed). The motivation is to make processing HTML5
> in non-browser apps less expensive without giving an incentive for the
> solutions to violate the spec ad hoc on their own.
>
> For example, an "XML tool chain" is important enough for my conformance
> checking service that if at this point the assumption of *conforming*
> HTML5 being convertible to XHTML5 was broken in corner cases, I'd
> probably come up with ad hoc trickery for masking it instead of throwing
> away the tool chain. I'd prefer not having to do that and not having to
> explain to everyone else who finds an "XML tool chain" to be of value
> what tricks I needed to pull off to fake it.
>
> I am not suggesting that HTML5 browsers halt and catch fire upon finding
> a form feed. And it is obvious that lossless conversion of all possible
> non-conforming HTML5 documents to XML is impossible anyway, so making
> that a goal would not be worthwhile.
>
> But what legitimate and popular use would a form feed have in HTML5? Why
> can't we call it non-conforming? Are there use cases other than
> converting .txt RFCs to HTML with regexps without bothering to get rid
> of the form feeds?
I don't think that it would be valuable to make that use case raise
errors.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list