[whatwg] Spec comments, sections 1-2

Wed Aug 5 05:25:20 PDT 2009

On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson <ian at hixie.ch> wrote:
> I'm pretty sure that character encoding support in browsers is more of a
> "collect them all" kind of thing than really based on content that
> requires it, to be honest.

Really? I think a lot of them are actually used. If you know anything I'd  
love to trim the amount of encodings the Web needs to a smaller list than  
what we currently ship with. Ideally this becomes a fixed list across all  
Web languages.

> If someone can provide a firm list of encodings that they are confident
> are required for a certain substantial percentage of the Web, I'm happy  
> to add the list to the spec.

Can you not do a survey on your large dataset of data to find this out? I  
read somewhere also that Adam Barth was able to add code to Google Chrome  
to figure out a better algorithm for Content-Type sniffing. Maybe  
something similar could be done here?

We've encountered problems by the way with using the Unicode encoding  
matching algorithm. Particularly on some Asian sites. I think we need to  
switch HTML5 back to something more akin to WebKit/Gecko/Trident. I  
realize this means more magic lists, but the current algorithm does not  
seem to cut it. E.g. sites rely on the fact that EUC_JP is not a  
recognized encoding but EUC-JP is.

-- 
Anne van Kesteren
http://annevankesteren.nl/