[whatwg] Establishing the character encoding and determinism

Henri Sivonen hsivonen at iki.fi
Mon May 28 07:28:06 PDT 2007

For proper communication it is important that a document is decoded  
reliably. In addition, for security reasons, it is important that  
documents are decoded the same way by browsers and by gatekeeper tools.

To this end, I think at least for conforming documents the algorithm  
for establishing the character encoding should be deterministic. I'd  
like to request two things:

1) When sniffing for meta charset, the current draft allows a use  
agent to give up sooner than after examining the first 512 bytes. To  
make meta charset sniffing reliable and deterministic so that it  
doesn't depend on flukes in buffering, I think UAs should (if there's  
no transfer protocol level charset label and no BOM) be required to  
consumer bytes until they find a meta charset, reach the EOF or have  
examined 512 bytes. That is, I think UAs should not be allowed to  
give up earlier. (On the other hand, I think UAs should be allowed to  
start examining the byte stream before 512 have been buffered without  
an IO error, since in general, byte stream buffer management should  
be up to the IO libraries and outside the scope of the HTML spec.)

2) Since the chardet step is optional and the spec doesn't make the  
Mozilla chardet behavior normative, I think the document should be  
considered non-conforming if the algorithm for establishing the  
character encoding proceeds to steps 6 (chardet) or 7 (last resort  

Personally, I'd prefer formulating such a document conformance  
requirement as part of the algorithm, but given recent feedback, I am  
aware that most people wish to maintain a separation of processing  
models and document conformance requirement. (For me, mixing the two  
is what I have to do in my head anyway. ;-)

It wouldn't hurt, though, to say in the section on writing documents  
that at least one of the following is required for document conformance:
  * A transfer protocol-level character encoding declaration.
  * A meta charset within the first 512 bytes.
  * A BOM.

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list