[whatwg] Web Encodings

Anne van Kesteren annevk at opera.com
Wed Aug 19 13:52:38 PDT 2009


On Wed, 19 Aug 2009 22:47:57 +0200, Anne van Kesteren <annevk at opera.com> wrote:
> Today every browser implements their own encoding label matching  
> algorithm, supports their own list of encodings, their own list of  
> encoding label aliases, and everything sort of works, but not really.
>
> HTML5 solves part of this problem by defining exactly how to identify an  
> encoding label alias in a text/html stream. It also defines which  
> encoding label matching algorithm to use, UTS22, but we found out that  
> this is incompatible with (existing) sites that specify EUC_JP at the  
> HTTP level and actually want to be decoded per UTF-8 according to a  
> <meta> in the text/html stream. This works fine if you have a strict  
> encoding label matching algorithm, but with UTS22, EUC_JP and EUC-JP  
> become the same thing, while only the latter is the actual encoding  
> label.
>
> Another problem HTML5 does not solve is giving a definitive list of  
> encodings clients have to implement to be compatible with a large body  
> of Web content. This means new clients will have to reverse engineer  
> that list from existing clients which I think is bad.

To continue, I'd like to request help with documenting which encodings and encoding label aliases and matching rules are supported by each browser so we can figure out what the rules should be. So far I have documented what Opera supports here:

  http://wiki.whatwg.org/wiki/Web_Encodings

I've also done research into the matching algorithm here:

  http://dump.testsuite.org/2009/encoding-matching/

So far I found that Internet Explorer and Firefox are the most strict when it comes to matching and most compatible with deployed content (not entirely unexpected). So it's very likely that a final document describing the algorithm should be based on these browsers.


-- 
Anne van Kesteren
http://annevankesteren.nl/


More information about the whatwg mailing list