[whatwg] Byte-wise tokenization algorithm

Sun Dec 21 09:19:39 PST 2008

On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:

> I suppose the big pivot point is "as if". A byte-wise implementation
> would replace character globally with byte, and any U+xxxx designation
> with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
> the actual algorithm implementation, no?

It states that what is done must be wholly equivalent to the given  
algorithm.

>> But an HTML5 implementation,
>> according to the spec, must at a minimum support the UTF-8 and
>> Windows-1252 encodings, so the overall implementation might not  
>> depending
>> on exactly how this is done.
>
> The plan is to convert Windows-1252 into UTF-8 before processing;  
> with a
> reasonably good iconv implementation, support for lots of encodings is
> possible. The implementation might not be fully conforming if iconv
> doesn't perform the proper (possibly context-sensitive; I haven't
> checked) substitution when it doesn't recognize a character, but it
> should be close.

I've never seen any way of getting iconv (at least via PHP) to do what  
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,  
however, possible using mbstring (which also has the advantage of not  
being system dependant), as well as with PHP6's Unicode support.

--
Geoffrey Sneddon
<http://gsnedders.com/>