[whatwg] Character References in HTML
lachlan.hunt at lachy.id.au
Wed Oct 12 09:51:46 PDT 2005
This should probably belong in the parsing section, when you get up
to writing it.
In HTML4, according to SGML rules, numeric character references in
the range from to are defined as UNUSED, which makes them
"non-SGML characters". Strictly speaking, it's not an error to refer to
these characters with character references (even the validator only
issues a warning: reference to a non-SGML character); but, AIUI, neither
SGML nor HTML4 assigns any meaning to them.
Technically, these character references should really refer to the
Unicode control characters, but reality dictates otherwise for
text/html, thanks to IE and countless (poorly written) books and
tutorials. I, therefore, think the spec should say something along
In HTML, numeric and hexadecimal character references referring to
code positions in the range from 128 to 159 (0x80 to 0x9F) should be
re-mapped to code positions in the Unicode character repertoire
according to the CP1252 to Unicode table [CP1252]. This does not
apply to XHTML.
HTML documents must not use numeric or hexadecimal character
references in this range, although browsers should support them for
backwards compatibility. Authors should instead refer to the correct
Unicode code position for these characters.
Also, I think this would also be a nice conformance requirement to see
for authoring tools:
HTML Authoring tools should automatically convert these character
references to either the equivalent Unicode code position or, if the
file's encoding supports it, the character itself, according to the
CP1252 to Unicode table [CP1252].
None of that should apply to XHTML, since XML explicitly allows this
range in the production for Char and, as far as I'm aware, no XHTML UA
implements this buggy behaviour.
More information about the whatwg