[whatwg] document.write("\r"): the spec doesn't say how to handle it.

Fri Nov 4 00:44:08 PDT 2011

On Thu, Nov 3, 2011 at 8:13 PM, David Flanagan <dflanagan at mozilla.com> wrote:
> Each tokenizer state would have to add a rule for CR that said  "emit LF,
> save the current tokenizer state, and set the tokenizer state to "after CR
> state".

The Validator.nu/Gecko tokenizer returns a "last input code unit
processed was CR" flag to the caller. If the tokenizer sees a CR, the
tokenizer processes it and returns to the caller immediately with the
flag set to true. The caller is responsible for checking if the next
input code unit is an LF, skipping over it and calling the tokenizer
again. This way, the tokenizer itself does not need to have the
capability of skipping over a character and the same capabilities that
are normally used for dealing with arbitrary buffer boundaries and
early returns after script end tags (or timers before the parser moved
off the main thread) work.

>> The parser operates on UTF-16 code units, so a lone surrogate is emitted.
>
> The spec seems pretty unambiguous that it operates on codepoints

The spec is empirically wrong. The wrongness has been reported. The
spec tries to retrofit Unicode theoretical purity onto legacy where no
purity existed.

The tokenizer operates on UTF-16 code units. document.write() feeds
UTF-16 code units to the tokenizer without lone surrogate
preprocessing. The tokenizer or the tree builder don't do anything
about lone surrogates. When consuming a byte stream, the converter
that converts (potentially unaligned and potentially foreign
byte-order) the UTF-16-encoded byte stream into a stream of UTF-16
code units is responsible for treating unpaired surrogates as
conversion errors.

Sorry about not mentioning earlier that the problematic tests are also
problematic in this sense.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/