[whatwg] [WA1] Specifying Character Encoding
Henri Sivonen
hsivonen at iki.fi
Thu Mar 1 01:12:28 PST 2007
On Mar 1, 2007, at 03:58, Ian Hickson wrote:
> On Sat, 9 Apr 2005, Lachlan Hunt wrote:
>>
>> In the current draft, for specifying the character encoding [1],
>> it is
>> stated:
>>
>> | In XHTML, the XML declaration should be used for inline character
>> | encoding information.
>> |
>> | Authors should avoid including inline character encoding
>> information.
>> | Character encoding information should instead be included at the
>> | transport level (e.g. using the HTTP Content-Type header).
>>
>> The second paragraph should only apply to HTML using the meta
>> element,
>> not XHTML using the XML declaration.
>
> I don't understand why it would be ok for one and not the other.
...
> I could see an argument for removing the advice from the HTML5 spec
> altogether, though. What do you think?
I think that encoding information should be included in the HTTP
payload. In my opinion, the spec should not advice against this.
Preferably, it would encourage putting the encoding information in
the payload. (The BOM or, in the case of XML, the UTF-8 defaulting of
the XML sniffing algorithm are fine.)
Rationale:
1) Ruby's Postulate.
2) It just uncool that I have to add the charset meta to the WA 1.0
spec if I download it to disk and typeset it for printing using
Prince which does not see the original HTTP headers. Real documents
do get detached from HTTP.
For application/xml and application/xhtml+xml, HTTP-level charset is
harmful, because the internal info is reliable and efficiently
sniffed, so the HTTP-level stuff is either redundant or wrong.
For text/html, also providing HTTP-level charset makes sense, because
internal encoding info sniffing is inefficient.
The text/xml type is considered harmful.
I think it should be a conformance requirement that the HTTP-level
encoding info and the internal payload info agree if both are supplied.
On Mar 1, 2007, at 09:13, Julian Reschke wrote:
> If a proxy transcodes xhtml today, and does not modify the XML
> declaration (when present), it will break the content, right?
* A transcoding proxy that does not modify the XML declaration and
tampers with application/* is broken.
* Before basing advice on conjecture about transcoding proxies, it
should be shown that transcoding proxies exist, are deployed (and for
a good reason) and their true nature should be researched. (For now,
I am treating non-reverse transcoding proxies as an urban legend.)
* Distributed UAs (where the proxy and the client are more tightly
coupled than in an HTTP client/proxy case, such as Opera Mini) do not
count.
* Russian Apache is not a trancoding proxy. It is a transcoding
origin server.
* Reverse proxies (e.g. http://apache.webthing.com/mod_proxy_html/)
are origin servers as far as browsers are concerned. Whatever reverse
proxies break is within the control of the reverse proxy operator.
* When pages follow the best practice of being encoded as UTF-8,
there is no legitimate reason to transcode.
* Browsers have supported all the relevant Cyrillic and Japanese
encodings for years, so the argumentation about Russian and Japanese
transcoding proxy needs falls flat today.
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list