[whatwg] Byte-wise tokenization algorithm
Ian Hickson
ian at hixie.ch
Sat Dec 20 21:41:33 PST 2008
On Sat, 20 Dec 2008, Edward Z. Yang wrote:
>
> I am currently working on a PHP5 implementation of the HTML5
> specification. PHP has abysmal Unicode support, and implementing Unicode
> streams in userspace may be unacceptablu slow. Thus, my questions:
>
> 1. Given an input stream that is known to be valid UTF-8, is it possible
> to implement the tokenization algorithm with byte-wise operations only?
> I think it's possible, since all of the character matching parts of the
> algorithm map to characters in ASCII space.
Yes. (At least, that's the intent; if you find anything that contradicts
that, please let me know.)
> 2. Would such an implementation be conforming?
Looking just at parsing, yes, probably... But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not depending
on exactly how this is done.
HTH,
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list