[whatwg] Byte-wise tokenization algorithm
Geoffrey Sneddon
foolistbar at googlemail.com
Sun Dec 21 01:07:40 PST 2008
On 21 Dec 2008, at 05:41, Ian Hickson wrote:
>> 1. Given an input stream that is known to be valid UTF-8, is it
>> possible
>> to implement the tokenization algorithm with byte-wise operations
>> only?
>> I think it's possible, since all of the character matching parts of
>> the
>> algorithm map to characters in ASCII space.
>
> Yes. (At least, that's the intent; if you find anything that
> contradicts
> that, please let me know.)
Indeed it is possible (or at least it certainly was a year and a half
ago, but I have seen nothing change that would stop it).
>> 2. Would such an implementation be conforming?
>
> Looking just at parsing, yes, probably... But an HTML5 implementation,
> according to the spec, must at a minimum support the UTF-8 and
> Windows-1252 encodings, so the overall implementation might not
> depending
> on exactly how this is done.
That should be no problem: just convert Windows-1252 to UTF-8 using
strtr() (as it is a SBCS this is simple enough — doing the inverse is
not) — see the attached file. Then all you need to do is normalize the
character set name to match all aliases of Windows-1252 and UTF-8, as
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to
Windows-1252. <http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php
> does that (the only dependancy is for getting the file via HTTP,
that can just be replaced with cURL if you wish to just require that).
--
Geoffrey Sneddon
<http://gsnedders.com/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: windows_1252_to_utf8.php
Type: text/php
Size: 4352 bytes
Desc: not available
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20081221/3333feb0/attachment-0001.bin>
More information about the whatwg
mailing list