[whatwg] Byte-wise tokenization algorithm

Sun Dec 21 08:35:53 PST 2008

Ian Hickson wrote:
> Yes. (At least, that's the intent; if you find anything that contradicts 
> that, please let me know.)

Great. I'll be sure to ping you if I find out otherwise.

> Looking just at parsing, yes, probably...

I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?

> But an HTML5 implementation, 
> according to the spec, must at a minimum support the UTF-8 and 
> Windows-1252 encodings, so the overall implementation might not depending 
> on exactly how this is done.

The plan is to convert Windows-1252 into UTF-8 before processing; with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.