[whatwg] Establishing the character encoding and determinism
Henri Sivonen
hsivonen at iki.fi
Mon May 28 07:28:06 PDT 2007
For proper communication it is important that a document is decoded
reliably. In addition, for security reasons, it is important that
documents are decoded the same way by browsers and by gatekeeper tools.
To this end, I think at least for conforming documents the algorithm
for establishing the character encoding should be deterministic. I'd
like to request two things:
1) When sniffing for meta charset, the current draft allows a use
agent to give up sooner than after examining the first 512 bytes. To
make meta charset sniffing reliable and deterministic so that it
doesn't depend on flukes in buffering, I think UAs should (if there's
no transfer protocol level charset label and no BOM) be required to
consumer bytes until they find a meta charset, reach the EOF or have
examined 512 bytes. That is, I think UAs should not be allowed to
give up earlier. (On the other hand, I think UAs should be allowed to
start examining the byte stream before 512 have been buffered without
an IO error, since in general, byte stream buffer management should
be up to the IO libraries and outside the scope of the HTML spec.)
2) Since the chardet step is optional and the spec doesn't make the
Mozilla chardet behavior normative, I think the document should be
considered non-conforming if the algorithm for establishing the
character encoding proceeds to steps 6 (chardet) or 7 (last resort
default).
Personally, I'd prefer formulating such a document conformance
requirement as part of the algorithm, but given recent feedback, I am
aware that most people wish to maintain a separation of processing
models and document conformance requirement. (For me, mixing the two
is what I have to do in my head anyway. ;-)
It wouldn't hurt, though, to say in the section on writing documents
that at least one of the following is required for document conformance:
* A transfer protocol-level character encoding declaration.
* A meta charset within the first 512 bytes.
* A BOM.
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the whatwg
mailing list