[whatwg] Forbidden characters in text/html

Wed Jun 6 17:46:00 PDT 2007

On Sun, 19 Mar 2006, Henri Sivonen wrote:
> 
> Since U+0000 has no legitimate reason to be there just to get dropped, 
> is any encounter of U+0000 a parse error?

Yes. Fixed.

> The way the spec is written, U+000D does not occur in the character 
> stream immediately before tokenization, but (as in XML!) it *can* appear 
> in the tree construction stage, because an NCR can expand into U+000D. 
> (I'm not suggesting any changes here--just noting how it is.)

Indeed.

> Since U+000D can occur in the tree construction stage, I think the point 
> under "8.2.2.3.7. How to handle tokens in the main phase" that says "A 
> character token that is one of one of U+0009 CHARACTER TABULATION, 
> U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C FORM FEED (FF), or 
> U+0020 SPACE" should include U+000D as well.

Good point. Fixed.

> On the other hand, I am wondering why the list of characters that 
> implements the concept of whitespace in the tokenization and tree 
> contruction stages includes U+000B LINE TABULATION and U+000C FORM FEED 
> (FF). Are they required for backwards-compatibility? I would guess they 
> do not actually show up on the Web that often. According to the W3C 
> Validator, those characters do not need to be allowed for formal 
> backwards compatibility with HTML4--on the contrary. 
> http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Fform-feed-in-tag.html 
> http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Fline-tabulation-in-tag.html

I don't have an opinion about U+000B. What would you want changed?

U+000C is allowed because converting text files to HTML can easily end up 
inserting FF characters. (e.g. RFCs have FF characters, conversion to HTML 
often leaves them.) I see no harm in allowing them really.

> In order to make all conforming HTML5 documents serializable as XHTML5, 
> it would be necessary to have a catch-all restriction stating that a 
> document is non-conforming if parsing it causes a non-XML character ( 
> http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the DOM. For 
> clarity, it would be nice to have the same restriction on the pre-parse 
> character stream, but such a restriction is not strictly necessary for 
> XHTML-serializability.

I don't really think we can guarentee that all conforming HTML5 documents 
be serializable as XHTML5 anyway. I'm reluctant to add catch-all clauses, 
because they tend to have unexpected consequences.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'