Henri Sivonen hsivonen at iki.fi
Tue Jan 3 00:50:26 PST 2012

> A solution that would border on reasonable would be decoding as
> US-ASCII up to the first non-ASCII byte and then deciding between
> UTF-8 and the locale-specific legacy encoding by examining the first
> non-ASCII byte and up to 3 bytes after it to see if they form a valid
> UTF-8 byte sequence. But trying to gain more statistical confidence
> about UTF-8ness than that would be bad for performance (either due to
> stalling stream processing or due to reloading).

And it's worth noting that the above paragraph states a "solution" to
the problem that is: "How to make it possible to use UTF-8 without
declaring it?"

Adding autodetection wouldn't actually force authors to use UTF-8, so
the problem Faruk stated at the start of the thread (authors not using
UTF-8 throughout systems that process user input) wouldn't be solved.

