[whatwg] Forbidden characters in text/html
Henri Sivonen
hsivonen at iki.fi
Sun Mar 19 08:29:25 PST 2006
On Mar 11, 2006, at 03:21, Ian Hickson wrote:
> On Sat, 25 Feb 2006, Henri Sivonen wrote:
>>
>> On Feb 25, 2006, at 02:02, Ian Hickson wrote:
>>
>>> On Sat, 23 Jul 2005, Henri Sivonen wrote:
>>>>
>>>> Which characters should a text/html HTML5 conformance checker
>>>> consider
>>>> forbidden? The same characters that are forbidden in XML 1.0
>>>> (\0, FF,
>>>> etc.)? Or some other set?
>>>
>>> In what context?
>>
>> In the pre-parse Unicode character stream on one hand and in the
>> post-parse (that is NCRs expanded) character data and attribute
>> values
>> on the other. IIRC, in XML 1.0 (but not 1.1) the restrictions are the
>> same in both cases.
>
> Well, the spec says to drop U+0000, and do something with U+000D
> such that
> U+000D never appears in the parse stream; the post-parse is just
> the DOM.
>
> Does that answer your question?
Sorry, still going on about this:
Since U+0000 has no legitimate reason to be there just to get
dropped, is any encounter of U+0000 a parse error?
The way the spec is written, U+000D does not occur in the character
stream immediately before tokenization, but (as in XML!) it *can*
appear in the tree construction stage, because an NCR can expand into
U+000D. (I'm not suggesting any changes here--just noting how it is.)
Since U+000D can occur in the tree construction stage, I think the
point under "8.2.2.3.7. How to handle tokens in the main phase" that
says "A character token that is one of one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C
FORM FEED (FF), or U+0020 SPACE" should include U+000D as well.
On the other hand, I am wondering why the list of characters that
implements the concept of whitespace in the tokenization and tree
contruction stages includes U+000B LINE TABULATION and U+000C FORM
FEED (FF). Are they required for backwards-compatibility? I would
guess they do not actually show up on the Web that often. According
to the W3C Validator, those characters do not need to be allowed for
formal backwards compatibility with HTML4--on the contrary.
http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%
2Fform-feed-in-tag.html
http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%
2Fline-tabulation-in-tag.html
In order to make all conforming HTML5 documents serializable as
XHTML5, it would be necessary to have a catch-all restriction stating
that a document is non-conforming if parsing it causes a non-XML
character ( http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the
DOM. For clarity, it would be nice to have the same restriction on
the pre-parse character stream, but such a restriction is not
strictly necessary for XHTML-serializability.
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list