[whatwg] Forbidden characters in text/html
ian at hixie.ch
Wed Jun 6 17:46:00 PDT 2007
On Sun, 19 Mar 2006, Henri Sivonen wrote:
> Since U+0000 has no legitimate reason to be there just to get dropped,
> is any encounter of U+0000 a parse error?
> The way the spec is written, U+000D does not occur in the character
> stream immediately before tokenization, but (as in XML!) it *can* appear
> in the tree construction stage, because an NCR can expand into U+000D.
> (I'm not suggesting any changes here--just noting how it is.)
> Since U+000D can occur in the tree construction stage, I think the point
> under "18.104.22.168.7. How to handle tokens in the main phase" that says "A
> character token that is one of one of U+0009 CHARACTER TABULATION,
> U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C FORM FEED (FF), or
> U+0020 SPACE" should include U+000D as well.
Good point. Fixed.
> On the other hand, I am wondering why the list of characters that
> implements the concept of whitespace in the tokenization and tree
> contruction stages includes U+000B LINE TABULATION and U+000C FORM FEED
> (FF). Are they required for backwards-compatibility? I would guess they
> do not actually show up on the Web that often. According to the W3C
> Validator, those characters do not need to be allowed for formal
> backwards compatibility with HTML4--on the contrary.
I don't have an opinion about U+000B. What would you want changed?
U+000C is allowed because converting text files to HTML can easily end up
inserting FF characters. (e.g. RFCs have FF characters, conversion to HTML
often leaves them.) I see no harm in allowing them really.
> In order to make all conforming HTML5 documents serializable as XHTML5,
> it would be necessary to have a catch-all restriction stating that a
> document is non-conforming if parsing it causes a non-XML character (
> http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the DOM. For
> clarity, it would be nice to have the same restriction on the pre-parse
> character stream, but such a restriction is not strictly necessary for
I don't really think we can guarentee that all conforming HTML5 documents
be serializable as XHTML5 anyway. I'm reluctant to add catch-all clauses,
because they tend to have unexpected consequences.
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg