[whatwg] Character encoding of document.open()ed documents
hsivonen at iki.fi
Wed Mar 31 04:12:00 PDT 2010
Currently, the spec says that document.open() sets the document's character encoding to UTF-16. This is what IE does except IE uses the label "unicode" instead of "UTF-16".
Gecko and WebKit set document's character encoding to UTF-8 even though the parser operates on UTF-16.
When loading external resources that don't have encoding labels, IE, Gecko and WebKit all use UTF-8 to decode the external resource.
Opera doesn't support document.charset or document.characterSet, but demo 37 and the demos discussed below show that Opera applies the default encoding (Windows-1252) to external resources referenced from document.open()ed documents.
Spec change request: Please change the spec to say that document.open() sets the document's character encoding to UTF-8 even though the parser operates on UTF-16 DOMStrings.
My real interest in this topic isn't so much about the initial character encoding value but about the effect of <meta charset> on document.open()ed documents.
Consider this demo in Gecko with the old HTML parser:
The demo alerts two times: first showing the REPLACEMENT CHARACTER and then showing LATIN SMALL LETTER R WITH ACUTE. First, Gecko parses the document with UTF-8 as the document's character encoding. During that parse, the value ISO-8859-2 from the meta is added to the cache entry for this stream (see my earlier email about reloading document.open()ed documents). Then the document is implicitly reloaded with ISO-8859-2 as the document's character encoding. This was implemented in https://bugzilla.mozilla.org/show_bug.cgi?id=255820 back when Gecko used UTF-16 instead of UTF-8 as the document's character encoding for document.open()ed docs and using UTF-16 for external resources made the external resources fail to parse.
Curiously, the implicit reloading behavior isn't particularly robust. In some situations the reload doesn't happen. I don't know what the logic is.
Demo with the order of meta and script swapped: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/435
None of IE, WebKit or Opera let the meta charset in a document.open()ed document have any effect, which seems to suggests that Gecko might be trying unnecessarily hard in this scenario.
Due to the test case for https://bugzilla.mozilla.org/show_bug.cgi?id=255820 I made the meta charset change the document's character encoding (but not reload) when the HTML5 parser is enabled in Gecko. See demos 435 and 434 with html5.enable=true. However, now it seems it might be better to revert that change to align with IE and WebKit--unless sites now depend on the Gecko behavior. Do other browser vendors have data showing sites depending on Gecko's behavior when loading external resources for document.open()ed docs?
hsivonen at iki.fi
More information about the whatwg