[whatwg] [WA1] Specifying Character Encoding

Thu Mar 1 01:12:28 PST 2007

On Mar 1, 2007, at 03:58, Ian Hickson wrote:
> On Sat, 9 Apr 2005, Lachlan Hunt wrote:
>>
>> In the current draft, for specifying the character encoding [1],  
>> it is
>> stated:
>>
>> | In XHTML, the XML declaration should be used for inline character
>> | encoding information.
>> |
>> | Authors should avoid including inline character encoding  
>> information.
>> | Character encoding information should instead be included at the
>> | transport level (e.g. using the HTTP Content-Type header).
>>
>> The second paragraph should only apply to HTML using the meta  
>> element,
>> not XHTML using the XML declaration.
>
> I don't understand why it would be ok for one and not the other.
...
> I could see an argument for removing the advice from the HTML5 spec  
> altogether, though. What do you think?

I think that encoding information should be included in the HTTP  
payload. In my opinion, the spec should not advice against this.  
Preferably, it would encourage putting the encoding information in  
the payload. (The BOM or, in the case of XML, the UTF-8 defaulting of  
the XML sniffing algorithm are fine.)

Rationale:
  1) Ruby's Postulate.
  2) It just uncool that I have to add the charset meta to the WA 1.0  
spec if I download it to disk and typeset it for printing using  
Prince which does not see the original HTTP headers. Real documents  
do get detached from HTTP.

For application/xml and application/xhtml+xml, HTTP-level charset is  
harmful, because the internal info is reliable and efficiently  
sniffed, so the HTTP-level stuff is either redundant or wrong.

For text/html, also providing HTTP-level charset makes sense, because  
internal encoding info sniffing is inefficient.

The text/xml type is considered harmful.

I think it should be a conformance requirement that the HTTP-level  
encoding info and the internal payload info agree if both are supplied.

On Mar 1, 2007, at 09:13, Julian Reschke wrote:

> If a proxy transcodes xhtml today, and does not modify the XML  
> declaration (when present), it will break the content, right?

  * A transcoding proxy that does not modify the XML declaration and  
tampers with application/* is broken.
  * Before basing advice on conjecture about transcoding proxies, it  
should be shown that transcoding proxies exist, are deployed (and for  
a good reason) and their true nature should be researched. (For now,  
I am treating non-reverse transcoding proxies as an urban legend.)
  * Distributed UAs (where the proxy and the client are more tightly  
coupled than in an HTTP client/proxy case, such as Opera Mini) do not  
count.
  * Russian Apache is not a trancoding proxy. It is a transcoding  
origin server.
  * Reverse proxies (e.g. http://apache.webthing.com/mod_proxy_html/)  
are origin servers as far as browsers are concerned. Whatever reverse  
proxies break is within the control of the reverse proxy operator.
  * When pages follow the best practice of being encoded as UTF-8,  
there is no legitimate reason to transcode.
  * Browsers have supported all the relevant Cyrillic and Japanese  
encodings for years, so the argumentation about Russian and Japanese  
transcoding proxy needs falls flat today.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/