[whatwg] Character-encoding-related threads

Fri Mar 30 05:04:05 PDT 2012

On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates <farukates at me.com> wrote:
> We like to think that “every web developer is surely building things in UTF-8 nowadays” but this is far from true. I still frequently break websites and webapps simply by entering my name (Faruk Ateş).

Firefox 12 whines to the error console when submitting a form using an
encoding that cannot represent all Unicode. Hopefully, after Firefox
12 has been released, this will help Web authors to actually test
their sites with the error console open locate forms that can corrupt
user input.

> On Wed, 7 Dec 2011, Henri Sivonen wrote:
>>
>> I believe I was implementing exactly what the spec said at the time I
>> implemented that behavior of Validator.nu. I'm particularly convinced
>> that I was following the spec, because I think it's not the optimal
>> behavior. I think pages that don't declare their encoding should always
>> be non-conforming even if they only contain ASCII bytes, because that
>> way templates created by English-oriented (or lorem ipsum -oriented)
>> authors would be caught as non-conforming before non-ASCII text gets
>> filled into them later. Hixie disagreed.
>
> I think it puts an undue burden on authors who are just writing small
> files with only ASCII. 7-bit clean ASCII is still the second-most used
> encoding on the Web (after UTF-8), so I don't think it's a small thing.
>
> http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

I still think that allowing ASCII-only pages to omit the encoding
declaration is the wrong call. I agree with Simon's point about the
doctype and reliance on quirks.

Firefox Nightly (14 if all goes well) whines to the error console when
the encoding hasn't been declared and about a bunch of other encoding
declaration-related bad conditions. It also warns about ASCII-only
pages, because I didn't want to burn cycles detecting whether a page
is ASCII-only and because I think it's the wrong call not to whine
about ASCII-only templates that might getting non-ASCII content later.
However, I suppressed the message about the lack of an encoding
declaration for different-origin frames, because it is so common for
ad iframes that contain only images or flash objects to lack an
encoding declaration that not suppressing the message would have made
the error console too noisy. It's cheaper to detect whether the
message is about to be emitted for a different-origin frame than to
detect whether it's about to be emitted for an ASCII-only page.
Besides, authors generally are powerless to fix the technical flaws of
different-origin embeds.

> On Mon, 19 Dec 2011, Henri Sivonen wrote:
>>
>> Hmm. The HTML spec isn't too clear about when alias resolution happens,
>> to I (incorrectly, I now think) mapped only "UTF-16", "UTF-16BE" and
>> "UTF-16LE" (ASCII-case-insensitive) to UTF-8 in meta without considering
>> aliases at that point. Hixie, was alias resolution supposed to happen
>> first? In Firefox, alias resolution happen after, so <meta
>> charset=iso-10646-ucs-2> is ignored per the non-ASCII superset rule.
>
> Assuming you mean for cases where the spec says things like "If encoding
> is a UTF-16 encoding, then change the value of encoding to UTF-8", then
> any alias of UTF-16, UTF-16LE, and UTF-16BE (there aren't any registered
> currently, but "Unicode" might need to be one) would be considered a
> match.
...
> Currently, "iso-10646-ucs-2" is neither an alias for UTF-16 nor an
> encoding that is overridden in any way. It's its own encoding.

That's not reality in Gecko.

> I hope the above is clear. Let me know if you think the spec is vague on
> the matter.

Evidently, it's too vague, because I read the spec and implemented
something different from what you meant.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/