[whatwg] document.write("\r"): the spec doesn't say how to handle it.
David Flanagan
dflanagan at mozilla.com
Thu Nov 3 11:13:04 PDT 2011
On 11/3/11 4:21 AM, Henri Sivonen wrote:
> On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflanagan at mozilla.com> wrote:
>> Firefox, Chrome and Safari all seem to do the right thing: wait for the next
>> character before tokenizing the CR.
> See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247
I hadn't used the live dom viewer before. That's really useful!
> Firefox tokenizes the CR immediately, emits an LF and then skips over
> the next character if it is an LF. When I designed the solution
> Firefox uses, I believed it was more correct and more compatible with
> legacy than whatever the spec said at the time.
I'm having a Duh! moment... I currently wait for the next character, but
what you describe is also works, and allows the document.write() spec to
make sense.
> Chrome seems to wait for the next character before tokenizing the CR.
>
>> And I think this means that the description of document.write needs to be changed.
> All along, I've felt thought that having U+0000 and CRLF handling as a
> stream preprocessing step was bogus and both should happen upon
> tokenization. So far, I've managed to convince Hixie about U+0000
> handling.
Each tokenizer state would have to add a rule for CR that said "emit
LF, save the current tokenizer state, and set the tokenizer state to
"after CR state". Actually, tokenizer states that already have a rule
for LF or whitespace would have to integrate this CR rule into that
rule. Then new after CR state would have two rules. On LF it would skip
the character and restore the saved state. On anything else it would
push the character back and restore the saved state.
>> Similarly, what should the tokenizer do if the document.write emits half of
>> a UTF-16 surrogate pair as the last character?
> The parser operates on UTF-16 code units, so a lone surrogate is emitted.
The spec seems pretty unambiguous that it operates on codepoints (though
I implemented mine using 16-bit code units). §13.2.1: " The input to the
HTML parsing process consists of a stream of Unicode code points". Also
§13.2.2.3 includes a list of codepoints beyond the BMP that are parse
errors. And finally, the tests in
http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test
require unpaired surrogates to be converted to the U+FFFD replacement
character. (Though my experience is that modifying my tokenizer to pass
those tests causes other tests to fail, which makes me wonder whether
unpaired surrogates are only supposed to be replaced in some but not all
tokenizer states)
Thanks, Henri!
David
More information about the whatwg
mailing list