[whatwg] Comment Syntax and Parsing
Lachlan Hunt
lachlan.hunt at lachy.id.au
Tue Jan 24 03:38:53 PST 2006
Anne van Kesteren wrote:
> Quoting Henri Sivonen <hsivonen at iki.fi>:
>> I guess the XML style is the simplest thing that could work. :-/
>
> You are talking about conformance, but what do you want the parser to
> do? And also there is talk about whitespace between -- and > but currently all
> kinds of chracters are allowed there (including - for instance).
It's important to decide upon what is to be considered a conformant
comment and what is not before we can settle upon the best way to parse
it. That way we can ensure that all conforming comments are handled
correctly and that error handling can be defined in an appropriate and
compatible way.
As for how to parse it, I'll use these test cases to demonstrate what I
consider to be the most sane way to handle comments. (Assume EOF at the
end of each one)
Test Case | Comment Content | Output
-----------------------------------|--------------------------|--------------
PA<!>SS | "" | PASS
PA<! ->SS | " -" | PASS
PA<! -->SS | " " | PASS
PA<!->SS | "-" | PASS
PA<!- ->SS | "- -" | PASS
PA<!- ->SS --> | "- -" | PASS -->
PA<!- <!-->SS --> | "- <!" | PASS -->
PA<!- <!-- ->SS --> | "- <!-- -" | PASS -->
PA<!- -->SS | "- " | PASS
PA<!- -- >SS | "- " | PASS
PA<!-- FAIL -->SS | " FAIL " | PASS
PA<!--> FAIL -->SS | "> FAIL " | PASS
PA<!--> FAIL <!-- -->SS | "> FAIL <!-- " | PASS
PA<!--> FAIL <!-- -- -->SS | "> FAIL <!-- -- " | PASS
PA<!-- > FAIL -- >SS | " > FAIL " | PASS
P<!-- -- >AS<!-- -->S | " " (2 comments) | PASS
PA<!-- FAIL -- FAIL -->SS | " FAIL -- FAIL " | PASS
P<!-- -- -->AS<!-- -- -->S | " -- " (2 comments) | PASS
PA<!-- -- -- -->SS | " -- -- " | PASS
PA<!-- FAIL -- FAIL -- FAIL -->SS | " FAIL -- FAIL -- FAIL " | PASS
PA<!--- FAIL -->SS | "- FAIL " | PASS
PA<!--- FAIL --->SS | "- FAIL -" | PASS
<!-- ->FAIL | " ->FAIL" |
<!--- ->FAIL | "- ->FAIL" |
PA<!--->-->SS | "->" | PASS
<!-- --- -> | (not sure) |
PA<!-- --- -->SS | " --- " | PASS
PA<!--- --- --->SS | "- --- -" | PASS
As for actually defining how that is parsed, I believe it should work
something like this. Throughout this algorithm, (x) is used to
represent the input character, not literal characters. The following
isn't perfect, I'm sure I've made some mistakes, but it should (I
believe) handle the above cases as described.
<!
* Switch to marked section open state
Marked Section Open State
--
* Create comment token
* Switch to comment state
DOCTYPE
* (DOCTYPE state)
else (easy parse error)
* Create comment token
* Append (x) to comment token
* Switch to comment end state
Comment State
-
* Switch to comment dash state
EOF
* Emit comment token and stop
else
* Append (x) to comment token
* Remain in comment state
Comment Dash State
-
* Switch to comment end state
EOF
* Append '-' to the comment token
* Emit comment token and stop
else
* Append '-' and (x) to comment token
* Switch to comment state
Comment End State
>
* Emit comment token
* Switch to data state
-
* Append '-' to comment token
else (easy parse error)
* Append '--' to comment token
* Consume every character up to, but not including, the first
occurrence of '>' or EOF (whichever comes first)
* Append the characters to the comment token
* If the comment token string matches /--\s*$/,
then strip those characters.
(This ensures that <!-- foo --> and <!-- foo -- > have the same
comment data)
* Emit the comment token
--
Lachlan Hunt
http://lachy.id.au/
More information about the whatwg
mailing list