[whatwg] UTF and BOM terminology

Henri Sivonen hsivonen at iki.fi
Sun May 27 01:56:29 PDT 2007


"If the encoding is one of UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or  
UTF-32LE, then authors can use a BOM at the start of the file to  
indicate the character encoding."

That sentence should read:
"If the encoding is one of UTF-8, UTF-16, or UTF-32, then authors can  
use a BOM at the start of the file to indicate the character encoding."

The encoding labels with LE or BE in them mean BOMless variants where  
the encoding label on the transfer protocol level gives the  
endianness. See http://www.ietf.org/rfc/rfc2781.txt When the spec  
refers to UTF-16 with BOM in a particular endianness, I think the  
spec should use "big-endian UTF-16" and "little-endian UTF-16".

Since declaring endianness on the transfer protocol level has no  
benefit over using the BOM when the label is right and there's a  
chance to get the label wrong, the encoding labels with explicit  
endianness are harmful for interchange. In my opinion, the spec  
should avoid giving authors any bad ideas by reinforcing these labels  
by repetition.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/





More information about the whatwg mailing list