[whatwg] Comment Syntax and Parsing

Tue Jan 24 03:38:53 PST 2006

Anne van Kesteren wrote:
> Quoting Henri Sivonen <hsivonen at iki.fi>:
>> I guess the XML style is the simplest thing that could work. :-/
> 
> You are talking about conformance, but what do you want the parser to 
> do? And also there is talk about whitespace between -- and > but currently all 
> kinds of chracters are allowed there (including - for instance).

It's important to decide upon what is to be considered a conformant
comment and what is not before we can settle upon the best way to parse
it.  That way we can ensure that all conforming comments are handled
correctly and that error handling can be defined in an appropriate and
compatible way.

As for how to parse it, I'll use these test cases to demonstrate what I
consider to be the most sane way to handle comments.  (Assume EOF at the
end of each one)

Test Case                          | Comment Content          | Output
-----------------------------------|--------------------------|--------------
PA<!>SS                            | ""                       | PASS
PA<! ->SS                          | " -"                     | PASS
PA<! -->SS                         | " "                      | PASS
PA<!->SS                           | "-"                      | PASS
PA<!- ->SS                         | "- -"                    | PASS
PA<!- ->SS -->                     | "- -"                    | PASS -->
PA<!- <!-->SS -->                  | "- <!"                   | PASS -->
PA<!- <!-- ->SS -->                | "- <!-- -"               | PASS -->
PA<!- -->SS                        | "- "                     | PASS
PA<!- -- >SS                       | "- "                     | PASS
PA<!-- FAIL -->SS                  | " FAIL "                 | PASS
PA<!--> FAIL -->SS                 | "> FAIL "                | PASS
PA<!--> FAIL <!-- -->SS            | "> FAIL <!-- "           | PASS
PA<!--> FAIL <!-- -- -->SS         | "> FAIL <!-- -- "        | PASS
PA<!-- > FAIL -- >SS               | " > FAIL "               | PASS
P<!-- -- >AS<!-- -->S              | " " (2 comments)         | PASS
PA<!-- FAIL -- FAIL -->SS          | " FAIL -- FAIL "         | PASS
P<!-- -- -->AS<!-- -- -->S         | " -- " (2 comments)      | PASS
PA<!-- -- -- -->SS                 | " -- -- "                | PASS
PA<!-- FAIL -- FAIL -- FAIL -->SS  | " FAIL -- FAIL -- FAIL " | PASS
PA<!--- FAIL -->SS                 | "- FAIL "                | PASS
PA<!--- FAIL --->SS                | "- FAIL -"               | PASS
<!-- ->FAIL                        | " ->FAIL"                |
<!--- ->FAIL                       | "- ->FAIL"               |
PA<!--->-->SS                      | "->"                     | PASS
<!-- --- ->                        | (not sure)               |
PA<!-- --- -->SS                   | " --- "                  | PASS
PA<!--- --- --->SS                 | "- --- -"                | PASS

As for actually defining how that is parsed, I believe it should work
something like this.  Throughout this algorithm, (x) is used to 
represent the input character, not literal characters.  The following 
isn't perfect, I'm sure I've made some mistakes, but it should (I 
believe) handle the above cases as described.

<!
   * Switch to marked section open state

Marked Section Open State
   --
     * Create comment token
     * Switch to comment state
   DOCTYPE
     * (DOCTYPE state)
   else (easy parse error)
     * Create comment token
     * Append (x) to comment token
     * Switch to comment end state

Comment State
   -
     * Switch to comment dash state
   EOF
     * Emit comment token and stop
   else
     * Append (x) to comment token
     * Remain in comment state

Comment Dash State
   -
     * Switch to comment end state
   EOF
     * Append '-' to the comment token
     * Emit comment token and stop
   else
     * Append '-' and (x) to comment token
     * Switch to comment state

Comment End State
   >
     * Emit comment token
     * Switch to data state
   -
     * Append '-' to comment token
   else (easy parse error)
     * Append '--' to comment token
     * Consume every character up to, but not including, the first
       occurrence of '>' or EOF (whichever comes first)
     * Append the characters to the comment token
     * If the comment token string matches /--\s*$/,
       then strip those characters.
       (This ensures that <!-- foo --> and <!-- foo --   > have the same
        comment data)
     * Emit the comment token

-- 
Lachlan Hunt
http://lachy.id.au/