[whatwg] Byte-wise tokenization algorithm
foolistbar at googlemail.com
Sun Dec 21 01:07:40 PST 2008
On 21 Dec 2008, at 05:41, Ian Hickson wrote:
>> 1. Given an input stream that is known to be valid UTF-8, is it
>> to implement the tokenization algorithm with byte-wise operations
>> I think it's possible, since all of the character matching parts of
>> algorithm map to characters in ASCII space.
> Yes. (At least, that's the intent; if you find anything that
> that, please let me know.)
Indeed it is possible (or at least it certainly was a year and a half
ago, but I have seen nothing change that would stop it).
>> 2. Would such an implementation be conforming?
> Looking just at parsing, yes, probably... But an HTML5 implementation,
> according to the spec, must at a minimum support the UTF-8 and
> Windows-1252 encodings, so the overall implementation might not
> on exactly how this is done.
That should be no problem: just convert Windows-1252 to UTF-8 using
strtr() (as it is a SBCS this is simple enough — doing the inverse is
not) — see the attached file. Then all you need to do is normalize the
character set name to match all aliases of Windows-1252 and UTF-8, as
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to
> does that (the only dependancy is for getting the file via HTTP,
that can just be replaced with cURL if you wish to just require that).
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4352 bytes
Desc: not available
More information about the whatwg