[imps] Byte-wise tokenization algorithm

Edward Z. Yang edwardzyang at thewritingpot.com
Sat Dec 20 20:32:01 PST 2008


I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?

Cheers,
Edward



More information about the Implementors mailing list