[whatwg] Byte-wise tokenization algorithm

Sun Dec 21 09:18:32 PST 2008

On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson <ian at hixie.ch> wrote:
> On Sat, 20 Dec 2008, Edward Z. Yang wrote:
>>
>> 1. Given an input stream that is known to be valid UTF-8, is it possible
>> to implement the tokenization algorithm with byte-wise operations only?
>> I think it's possible, since all of the character matching parts of the
>> algorithm map to characters in ASCII space.
>
> Yes. (At least, that's the intent; if you find anything that contradicts
> that, please let me know.)

I think there are some cases where it still should work but you might
have to be a little careful - e.g. "<table>foo" notionally results in
three parse errors according to the spec (one for each character token
which gets foster-parented), so "<table>☹" results in one if you work
with Unicode characters but three if you treat each UTF-8 byte as a
separate character token.

But in practice, tokenisers emit sequence-of-many-characters tokens
instead of single-character tokens, so they only emit one parse error
for "<table>foo", and the html5lib test cases assume that behaviour,
and it should work identically if you have sequence-of-many-bytes
tokens instead.

(Apparently only the distinction between 0 and more-than-0 parse
errors is important as far as the spec is concerned, since that has an
effect on whether the document is conforming; but it seems useful for
implementors to share test cases that are precise about exactly where
all the parse errors are emitted, since that helps find bugs, and so
the parse error count is relevant.)

-- 
Philip Taylor
excors at gmail.com