[whatwg] Fwd: Entity parsing
Øistein E. Andersen
liszt at coq.no
Fri Apr 24 15:14:31 PDT 2009
On 23 May 2008, at 03:50, Ian Hickson wrote:
> On Thu, 28 Jun 2007, Øistein E. Andersen wrote:
>>
>> 1) Is it useful to handle unterminated entities followed by an
>> alphanumerical character like IE does? [...]
>>
>> 2) HTML 4.01 allows the semicolon to be omitted in certain cases.
>> [...] Firefox and Safari both support this, and it would
>> seem meaningless to change the way conforming documents are parsed
>> [...]
>>
>> 3) Will new entities ever be needed? If yes, can new entities adopt
>> existing conformance criteria and parsing rules?
>>
>> [...]
>
> New entities have since been added, and the rules for parsing entities
> (sorry, "named character references") have been changed a bit.
> However, I
> am reluctant to change this from what we have now, since what we
> have now
> works well. How strongly do you feel about this?
I think I may have expressed my concern in rather too abstract terms
previously.
The named character references currently present in HTML5 can be
subdivided (roughly) into the following subsets:
IE4 < HTML4 < HTML5
Approximately 100 named character references are included in the IE4
set, 200 in the HTML4 set, and 2,000 in the HTML5 set.
When a named character reference is followed by a semicolon, it
clearly has to be expanded, but how to handle non-semicolon-terminated
character references is less obvious.
Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named
character reference from the IE4 (resp. HTML4, HTML5) set, and let .
(full stop) represent any character other than semicolon, and ^
(circumflex) any character which is (roughly) not an ASCII letter or
digit (i.e., [^a-zA-Z0-9]). Not completely unreasonable sets of
character references to expand (outside of attribute values) include:
1) &IE4^
2) &IE4.
3) &HTML4^
4) &IE4. &HTML4^
5) &HTML4.
6) &IE4. &HTML5^
7) &HTML4. &HTML5^
8) &HTML5.
(The set of character references to be expanded in attribute values
could be obtained by replacing . by ^ above.)
Currently, Opera follows 1), IE 2), and Safari and Firefox 3).
My main concern is that &HTML4^ is actually legitimate in HTML4 and
works in both Safari and Firefox today, and that HTML5 should not
change the rendering of valid HTML4 pages unless there is a good
reason to do so.
4) does not break any valid HTML4 pages and does also not cause any
character references to be expanded which are not already expanded in
either IE or both Safari and Firefox, so this should be possible to
implement.
[Options 5), 6) and 8) can, to a greater or lesser extent, be
specified more easily, but might be too controversial. There are pages
relying on, e.g., `10&ndash20' to work, though, so handling character
references in a more liberal way would actually have some benefits;
only invalid mark-up would be affected in any case; and the negative
effects are to a certain extent compounded by the more conservative
treatment in attribute values. That being said, I do of course
realise that it will be seen as safer not to expand too many character
references as long as the actual impact remains difficult to quantify.]
--
Øistein E. Andersen
More information about the whatwg
mailing list