[whatwg] Byte-wise tokenization algorithm
foolistbar at googlemail.com
Sun Dec 21 09:19:39 PST 2008
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:
> I suppose the big pivot point is "as if". A byte-wise implementation
> would replace character globally with byte, and any U+xxxx designation
> with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
> the actual algorithm implementation, no?
It states that what is done must be wholly equivalent to the given
>> But an HTML5 implementation,
>> according to the spec, must at a minimum support the UTF-8 and
>> Windows-1252 encodings, so the overall implementation might not
>> on exactly how this is done.
> The plan is to convert Windows-1252 into UTF-8 before processing;
> with a
> reasonably good iconv implementation, support for lots of encodings is
> possible. The implementation might not be fully conforming if iconv
> doesn't perform the proper (possibly context-sensitive; I haven't
> checked) substitution when it doesn't recognize a character, but it
> should be close.
I've never seen any way of getting iconv (at least via PHP) to do what
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,
however, possible using mbstring (which also has the advantage of not
being system dependant), as well as with PHP6's Unicode support.
More information about the whatwg