[whatwg] Character References in HTML

Ian Hickson ian at hixie.ch
Tue Jun 5 17:20:09 PDT 2007

On Thu, 13 Oct 2005, Lachlan Hunt wrote:
> In HTML4, according to SGML rules, numeric character references in the 
> range from € to Ÿ are defined as UNUSED, which makes them 
> "non-SGML characters".  Strictly speaking, it's not an error to refer to 
> these characters with character references (even the validator only 
> issues a warning: reference to a non-SGML character); but, AIUI, neither 
> SGML nor HTML4 assigns any meaning to them. 
> http://lachy.id.au/log/2005/10/char-refs
> Technically, these character references should really refer to the 
> Unicode control characters, but reality dictates otherwise for 
> text/html, thanks to IE and countless (poorly written) books and 
> tutorials.  I, therefore, think the spec should say something along 
> these lines:
>   In HTML, numeric and hexadecimal character references referring to
>   code positions in the range from 128 to 159 (0x80 to 0x9F) should be
>   re-mapped to code positions in the Unicode character repertoire
>   according to the CP1252 to Unicode table [CP1252].  This does not
>   apply to XHTML.

Done. (With a must, and with an explicit table, since CP1252 doesn't 
define all those characters.)

>   HTML documents must not use numeric or hexadecimal character
>   references in this range, although browsers should support them for
>   backwards compatibility.  Authors should instead refer to the correct
>   Unicode code position for these characters.


> Also, I think this would also be a nice conformance requirement to see for
> authoring tools:
>   HTML Authoring tools should automatically convert these character
>   references to either the equivalent Unicode code position or, if the
>   file's encoding supports it, the character itself, according to the
>   CP1252 to Unicode table [CP1252].

Not done, but it's redundant anyway since simply implementing the spec 
will do this automatically (the spec doesn't round-trip the out-of-range 
entities through the DOM).

> None of that should apply to XHTML, since XML explicitly allows this 
> range in the production for Char and, as far as I'm aware, no XHTML UA 
> implements this buggy behaviour.


Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

More information about the whatwg mailing list