[whatwg] Byte-wise tokenization algorithm

Geoffrey Sneddon foolistbar at googlemail.com
Sun Dec 21 01:07:40 PST 2008


On 21 Dec 2008, at 05:41, Ian Hickson wrote:

>> 1. Given an input stream that is known to be valid UTF-8, is it  
>> possible
>> to implement the tokenization algorithm with byte-wise operations  
>> only?
>> I think it's possible, since all of the character matching parts of  
>> the
>> algorithm map to characters in ASCII space.
>
> Yes. (At least, that's the intent; if you find anything that  
> contradicts
> that, please let me know.)

Indeed it is possible (or at least it certainly was a year and a half  
ago, but I have seen nothing change that would stop it).

>> 2. Would such an implementation be conforming?
>
> Looking just at parsing, yes, probably... But an HTML5 implementation,
> according to the spec, must at a minimum support the UTF-8 and
> Windows-1252 encodings, so the overall implementation might not  
> depending
> on exactly how this is done.

That should be no problem: just convert Windows-1252 to UTF-8 using  
strtr() (as it is a SBCS this is simple enough — doing the inverse is  
not) — see the attached file. Then all you need to do is normalize the  
character set name to match all aliases of Windows-1252 and UTF-8, as  
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to  
Windows-1252. <http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php 
 > does that (the only dependancy is for getting the file via HTTP,  
that can just be replaced with cURL if you wish to just require that).


--
Geoffrey Sneddon
<http://gsnedders.com/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: windows_1252_to_utf8.php
Type: text/php
Size: 4352 bytes
Desc: not available
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20081221/3333feb0/attachment.bin>


More information about the whatwg mailing list