[whatwg] Default encoding to UTF-8?
hsivonen at iki.fi
Tue Apr 3 04:59:25 PDT 2012
On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
<xn--mlform-iua at xn--mlform-iua.no> wrote:
>> I mean the performance impact of reloading the page or,
>> alternatively, the loss of incremental rendering.)
>> A solution that would border on reasonable would be decoding as
>> US-ASCII up to the first non-ASCII byte
> Thus possibly prescan of more than 1024 bytes?
I didn't mean a prescan. I meant proceeding with the real parse and
switching decoders in midstream. This would have the complication of
also having to change the encoding the document object reports to
>> and then deciding between
>> UTF-8 and the locale-specific legacy encoding by examining the first
>> non-ASCII byte and up to 3 bytes after it to see if they form a valid
>> UTF-8 byte sequence.
> Except for the specifics, that sounds like more or less the idea I
> tried to state. May be it could be made into a bug in Mozilla?
It's not clear that this is actually worth implementing or spending
time on its this stage.
> However, there is one thing that should be added: The parser should
> default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII.
That would break form submissions.
>> But trying to gain more statistical confidence
>> about UTF-8ness than that would be bad for performance (either due to
>> stalling stream processing or due to reloading).
> So here you say tthat it is better to start to present early, and
> eventually reload [I think] if during the presentation the encoding
> choice shows itself to be wrong, than it would be to investigate too
> much and be absolutely certain before starting to present the page.
I didn't intend to suggest reloading.
>> Adding autodetection wouldn't actually force authors to use UTF-8, so
>> the problem Faruk stated at the start of the thread (authors not using
>> UTF-8 throughout systems that process user input) wouldn't be solved.
> If we take that logic to its end, then it would not make sense for the
> validator to display an error when a page contains a form without being
> UTF-8 encoded, either. Because, after all, the backend/whatever could
> be non-UTF-8 based. The only way to solve that problem on those
> systems, would be to send form content as character entities. (However,
> then too the form based page should still be UTF-8 in the first place,
> in order to be able to take any content.)
Presumably, when an author reacts to an error message, (s)he not only
fixes the page but also the back end. When a browser makes encoding
guesses, it obviously cannot fix the back end.
> [ Original letter continued: ]
>>> Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
>>> detection. So it might still be an competitive advantage.
>> It would be interesting to know what exactly Chrome does. Maybe
>> someone who knows the code could enlighten us?
> +1 (But their approach looks similar to the 'border on sane' approach
> you presented. Except that they seek to detect also non-UTF-8.)
I'm slightly disappointed but not surprised that this thread hasn't
gained a message explaining what Chrome does exactly.
hsivonen at iki.fi
More information about the whatwg