[whatwg] Byte-wise tokenization algorithm
Edward Z. Yang
edwardzyang at thewritingpot.com
Sun Dec 21 08:35:53 PST 2008
Ian Hickson wrote:
> Yes. (At least, that's the intent; if you find anything that contradicts
> that, please let me know.)
Great. I'll be sure to ping you if I find out otherwise.
> Looking just at parsing, yes, probably...
I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?
> But an HTML5 implementation,
> according to the spec, must at a minimum support the UTF-8 and
> Windows-1252 encodings, so the overall implementation might not depending
> on exactly how this is done.
The plan is to convert Windows-1252 into UTF-8 before processing; with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.
More information about the whatwg