[whatwg] Character References in HTML

Lachlan Hunt lachlan.hunt at lachy.id.au
Wed Oct 12 09:51:46 PDT 2005

   This should probably belong in the parsing section, when you get up 
to writing it.

   In HTML4, according to SGML rules, numeric character references in 
the range from € to Ÿ are defined as UNUSED, which makes them 
"non-SGML characters".  Strictly speaking, it's not an error to refer to 
these characters with character references (even the validator only 
issues a warning: reference to a non-SGML character); but, AIUI, neither 
SGML nor HTML4 assigns any meaning to them.

Technically, these character references should really refer to the 
Unicode control characters, but reality dictates otherwise for 
text/html, thanks to IE and countless (poorly written) books and 
tutorials.  I, therefore, think the spec should say something along 
these lines:

   In HTML, numeric and hexadecimal character references referring to
   code positions in the range from 128 to 159 (0x80 to 0x9F) should be
   re-mapped to code positions in the Unicode character repertoire
   according to the CP1252 to Unicode table [CP1252].  This does not
   apply to XHTML.

   HTML documents must not use numeric or hexadecimal character
   references in this range, although browsers should support them for
   backwards compatibility.  Authors should instead refer to the correct
   Unicode code position for these characters.

Also, I think this would also be a nice conformance requirement to see 
for authoring tools:

   HTML Authoring tools should automatically convert these character
   references to either the equivalent Unicode code position or, if the
   file's encoding supports it, the character itself, according to the
   CP1252 to Unicode table [CP1252].


None of that should apply to XHTML, since XML explicitly allows this 
range in the production for Char and, as far as I'm aware, no XHTML UA 
implements this buggy behaviour.

Lachlan Hunt

More information about the whatwg mailing list