[whatwg] Byte-wise tokenization algorithm

Sat Dec 20 21:41:33 PST 2008

On Sat, 20 Dec 2008, Edward Z. Yang wrote:
>
> I am currently working on a PHP5 implementation of the HTML5 
> specification. PHP has abysmal Unicode support, and implementing Unicode 
> streams in userspace may be unacceptablu slow. Thus, my questions:
> 
> 1. Given an input stream that is known to be valid UTF-8, is it possible 
> to implement the tokenization algorithm with byte-wise operations only? 
> I think it's possible, since all of the character matching parts of the 
> algorithm map to characters in ASCII space.

Yes. (At least, that's the intent; if you find anything that contradicts 
that, please let me know.)

> 2. Would such an implementation be conforming?

Looking just at parsing, yes, probably... But an HTML5 implementation, 
according to the spec, must at a minimum support the UTF-8 and 
Windows-1252 encodings, so the overall implementation might not depending 
on exactly how this is done.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'