[whatwg] document.write("\r"): the spec doesn't say how to handle it.

Ian Hickson ian at hixie.ch
Mon Feb 13 14:56:36 PST 2012


On Mon, 19 Dec 2011, Henri Sivonen wrote:
> On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian at hixie.ch> wrote:
> > I can remove the text "one at a time", if you like. Would that be 
> > satisfactory? Or I guess I could change the spec to say that the 
> > parser should process the characters, rather than the tokenizer, since 
> > really it's the whole shebang that needs to be involved (stream 
> > preprocessor and everything). Any opinions on what the right text is 
> > here?
> 
> I'd like the CRLF preprocessing to be defined as an eager stateful
> operation so that there's one bit of state: "last was CR". Then, input
> is handled as follows:
> If the input character is CR, set "last was CR" to true and emit LF.
> If the input character is LF and "last was CR" is true, don't emit
> anything and set "last was CR" to false.
> If the input character is LF and "last was CR" is is false, emit LF.
> Else set "last was CR" to false and emit the input character.

I've done something like this (but simpler to spec).

I've also done the second change I suggest above.


On Thu, 3 Nov 2011, David Flanagan wrote:
> 
> The spec seems pretty unambiguous that it operates on codepoints (though 
> I implemented mine using 16-bit code units). §13.2.1: " The input to 
> the HTML parsing process consists of a stream of Unicode code points".  
> Also §13.2.2.3 includes a list of codepoints beyond the BMP that are 
> parse errors.  And finally, the tests in 
> http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test 
> require unpaired surrogates to be converted to the U+FFFD replacement 
> character.  (Though my experience is that modifying my tokenizer to pass 
> those tests causes other tests to fail, which makes me wonder whether 
> unpaired surrogates are only supposed to be replaced in some but not all 
> tokenizer states)

This has changed a bit. In particular, "Unicode code point" is currently 
defined in a way that is (in theory) black-box indistinguishable from 
UTF-16 handling, but without making "astral characters" into second-class 
citizens.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list