[whatwg] Character encoding of document.open()ed documents
and-py at doxdesk.com
Wed Mar 31 20:26:32 PDT 2010
Henri Sivonen wrote:
> Spec change request: Please change the spec to say that document.open()
> sets the document's character encoding to UTF-8
+1. UTF-16 is a troublesome encoding for [X]HTML documents and should
be consistently discouraged; as a ASCII-non-superset it interacts very
poorly with byte interfaces in HTTP and form submissions.
No browser will actually try to submit a form as UTF-16 for this reason,
but it still causes problems. eg. Firefox misleadingly sets the
`_charset_` hack field to 'UTF-16' even though the submission is
> even though the parser operates on UTF-16 DOMStrings.
The term 'UTF-16' can mean two very different things: either a sequence
of 16-bit code units (as in DOMString), or a sequence of bytes which,
taken as UTF-16LE or UTF-16BE, represent 16-code units. Unicode's
tradition of conflating the meanings of the code unit sequence and the
byte sequence has caused much confusion.
DOM Level 3 LS made the mistake of saying that because DOMStrings are
UTF-16-code-units, XML documents parsed from
`LSInput.characterStream`/`StringData` should receive the `encoding`
'UTF-16', as if the parser has done a conversion from UTF-16-bytes to
characters, though no such process has actually taken place.
Consequently when you serialise a document parsed from a string in DOM
Level 3 LS you get an unexpected and unwanted UTF-16 document.
mailto:and at doxdesk.com
More information about the whatwg