[whatwg] Default encoding to UTF-8?

Jukka K. Korpela jkorpela at cs.tut.fi
Tue Dec 6 22:48:17 PST 2011

2011-12-07 2:36, Leif Halvard Silli wrote:

> This entire thread started with a user problem.

As far as I can see, the problem presented was: “I still frequently 
break websites and webapps simply by entering my name (Faruk Ateş).” 
What we need to fix such issues is that sites and applications are 
modified to _deal with_ any characters, and this means that they 
minimally need to _parse_ input data as UTF-8 encoded. Of course their 
authors need to specify that the form data is to be submitted as UTF-8 
encoded, normally by making the page UTF-8 encoded and declaring it as 
such. This is surely the most trivial side of the matter.

Pages that currently cannot handle the letter “ş” in input data would 
not behave any better if browsers started treating them as UTF-8 
encoded, which is what the proposed change would me. On the contrary, 
they would work worse. They probably currently work for some set of 
characters outside ASCII, such as ISO-8859-1, and the change would stop 
that, as letters like “â” would now be transmitted as UTF-8 encoded but 
the form handler implies another encoding and sees the data as something 
completely different.

> But with the proposed change, then even users *outside* the locales that
> share the default encoding of the sloppy author's locale, would benefit.

Exactly how would _any_ user benefit from the proposed change? I have 
shown that for the form data issue presented, the change would create 
serious problems, not solve any—except in the rather theoretical case 
where form data processing is based on UTF-8, the page is actually UTF-8 
encoded but its encoding is not declared in any way (any examples of 
such pages around?) and the user’s browser implies an encoding other 
than UTF-8. In this theoretical case, the error correction principle 
I’ve suggested (don’t just apply an encoding if it turns out that the 
page cannot be in that encoding) would probably fix the problem if the 
page contains non-ASCII characters.


More information about the whatwg mailing list