[imps] Byte-wise tokenization algorithm

Edward Z. Yang edwardzyang at thewritingpot.com
Sat Dec 20 20:32:01 PST 2008

I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?


