[whatwg] document.write("\r"): the spec doesn't say how to handle it.
Ian Hickson
ian at hixie.ch
Mon Feb 13 14:56:36 PST 2012
On Mon, 19 Dec 2011, Henri Sivonen wrote:
> On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian at hixie.ch> wrote:
> > I can remove the text "one at a time", if you like. Would that be
> > satisfactory? Or I guess I could change the spec to say that the
> > parser should process the characters, rather than the tokenizer, since
> > really it's the whole shebang that needs to be involved (stream
> > preprocessor and everything). Any opinions on what the right text is
> > here?
>
> I'd like the CRLF preprocessing to be defined as an eager stateful
> operation so that there's one bit of state: "last was CR". Then, input
> is handled as follows:
> If the input character is CR, set "last was CR" to true and emit LF.
> If the input character is LF and "last was CR" is true, don't emit
> anything and set "last was CR" to false.
> If the input character is LF and "last was CR" is is false, emit LF.
> Else set "last was CR" to false and emit the input character.
I've done something like this (but simpler to spec).
I've also done the second change I suggest above.
On Thu, 3 Nov 2011, David Flanagan wrote:
>
> The spec seems pretty unambiguous that it operates on codepoints (though
> I implemented mine using 16-bit code units). §13.2.1: " The input to
> the HTML parsing process consists of a stream of Unicode code points".
> Also §13.2.2.3 includes a list of codepoints beyond the BMP that are
> parse errors. And finally, the tests in
> http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test
> require unpaired surrogates to be converted to the U+FFFD replacement
> character. (Though my experience is that modifying my tokenizer to pass
> those tests causes other tests to fail, which makes me wonder whether
> unpaired surrogates are only supposed to be replaced in some but not all
> tokenizer states)
This has changed a bit. In particular, "Unicode code point" is currently
defined in a way that is (in theory) black-box indistinguishable from
UTF-16 handling, but without making "astral characters" into second-class
citizens.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list