[whatwg] <% text %> and <? text ?> in corporate intranet html content

Tue Feb 9 21:26:01 PST 2010

On 2/9/10 11:56 PM, Tab Atkins Jr. wrote:
> On Tue, Feb 9, 2010 at 9:05 PM, Biju<bijumaillist at gmail.com>  wrote:
>> What should a user agent display when html content is...
>>
>> <html><body>
>> <%@ page language="java" %>
>> </body></html>
>>
>> At present IE and Safari display blank
>>
>> Firefox display<%@ page language="java" %>

As does Opera, and Firefox with the HTML5 parser enabled.

>> But for
>> <html><body>
>> abc<? echo ">"  ?>  xyz
>> </body></html>
>>
>> Firefox display...
>> abc " ?>  xyz

As does Opera, and Firefox with the HTML5 parser enabled.

> Can someone else with more familiarity with the parser algorithm help
> out here?

For the "<%@" case, it looks like the state machine will go through the 
following states:

   Data state -> Tag open state

[1].  When encountering a '%' in the "Tag open" state, the specification 
says:

     Parse error. Emit a U+003C LESS-THAN SIGN character token
     and reconsume the current input character in the data state.[2]

So the state will then remain "Data state" until the next '&' or '<' or 
EOF is seen, so the entire string up to the </body> will be treated as 
literal text.

For the "<?" case, the state transitions will be:

   Data state -> Tag open state -> Bogus comment state

[1],[2].  Then the specification says to:

   Consume every character up to and including the first U+003E
   GREATER-THAN SIGN character (>) or the end of the file (EOF),
   whichever comes first. Emit a comment token whose data is the
   concatenation of all the characters starting from and including
   the character that caused the state machine to switch into the bogus
   comment state, up to and including the character immediately before
   the last consumed character (i.e. up to the character just before the
   U+003E or EOF character). (If the comment was started by the end of
   the file (EOF), the token is empty.)

   Switch to the data state. [3]

Or in other words, stop the bogus comment at the first '>' you see and 
then start parsing normally again.  In this case, that means treating 
everything up to the next '<' or '&' or EOF as literal text.

So the currently-specified behavior in fact matches the observed Firefox 
behavior (with either parser) on these simple testcases.

-Boris

[1] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state
[2] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tag-open-state
[3] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#bogus-comment-state