[whatwg] Character encoding of document.open()ed documents

Henri Sivonen hsivonen at iki.fi
Wed Mar 31 04:12:00 PDT 2010


Currently, the spec says that document.open() sets the document's character encoding to UTF-16. This is what IE does except IE uses the label "unicode" instead of "UTF-16".
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/438

Gecko and WebKit set document's character encoding to UTF-8 even though the parser operates on UTF-16.
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/439

When loading external resources that don't have encoding labels, IE, Gecko and WebKit all use UTF-8 to decode the external resource.
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/437

Opera doesn't support document.charset or document.characterSet, but demo 37 and the demos discussed below show that Opera applies the default encoding (Windows-1252) to external resources referenced from document.open()ed documents.

Spec change request: Please change the spec to say that document.open() sets the document's character encoding to UTF-8 even though the parser operates on UTF-16 DOMStrings.

My real interest in this topic isn't so much about the initial character encoding value but about the effect of <meta charset> on document.open()ed documents.

Consider this demo in Gecko with the old HTML parser:
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/434

The demo alerts two times: first showing the REPLACEMENT CHARACTER and then showing LATIN SMALL LETTER R WITH ACUTE. First, Gecko parses the document with UTF-8 as the document's character encoding. During that parse, the value ISO-8859-2 from the meta is added to the cache entry for this stream (see my earlier email about reloading document.open()ed documents). Then the document is implicitly reloaded with ISO-8859-2 as the document's character encoding. This was implemented in https://bugzilla.mozilla.org/show_bug.cgi?id=255820 back when Gecko used UTF-16 instead of UTF-8 as the document's character encoding for document.open()ed docs and using UTF-16 for external resources made the external resources fail to parse.

Curiously, the implicit reloading behavior isn't particularly robust. In some situations the reload doesn't happen. I don't know what the logic is.
Demo with the order of meta and script swapped: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/435

None of IE, WebKit or Opera let the meta charset in a document.open()ed document have any effect, which seems to suggests that Gecko might be trying unnecessarily hard in this scenario.

Due to the test case for https://bugzilla.mozilla.org/show_bug.cgi?id=255820 I made the meta charset change the document's character encoding (but not reload) when the HTML5 parser is enabled in Gecko. See demos 435 and 434 with html5.enable=true. However, now it seems it might be better to revert that change to align with IE and WebKit--unless sites now depend on the Gecko behavior. Do other browser vendors have data showing sites depending on Gecko's behavior when loading external resources for document.open()ed docs?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/




More information about the whatwg mailing list