[whatwg] Entity parsing

Wed Jun 27 19:53:09 PDT 2007

On 28 Jun 2007, at 12:43AM, Ian Hickson wrote:

> Sadly none of the arguments in any direction right now are particularly 
> persuasive.

Indeed.

> I'm not really convinced that the data that the above proposed survey 
> might collect would actually help, since it doesn't tell us the what was 
> intended by the author.

To a certain extent, this depends on the results.

Some conclusions can be drawn without actually knowing the author's intent
at all: if, for instance, "&foo[^;]" is exceedingly rare, then what the author meant
does not really matter, since the construct does not need to be supported anyway.

I also tend to think that entities that are part of existing words are highly likely
to be supposed to be expanded. Of course, 100% accuracy cannot be achieved,
but this is not really needed for the results to be useful.

> Am I correct in assuming that you would like the spec changed? What would 
> you like the spec changed to, exactly?

I would really like an informed decision, and I currently get the impression
that rules are changed to follow IE by default rather than to handle existing
content, which may lead to unnecessary complicated rules that do not
actually handle existing documents optimally.

More specifically, some of the points that probably should be
addressed are the following:

1) Is it useful to handle unterminated entities followed by an alphanumerical
character like IE does? The number of documents for which this actually helps
might be small compared to the number of documents that contain other,
incorrigible errors. The process also introduces errors, albeit not in conforming
documents. Is the gain worth the added complexity?

If so, then should this apply to all entities? (Probably not.) Would it be useful
to add to/remove from the set supported by IE7? (This may seem insane,
but we should try to avoid premature decisions.)

2) HTML 4.01 allows the semicolon to be omitted in certain cases. Does this
cause problems? Firefox and Safari both support this, and it would seem
meaningless to change the way conforming documents are parsed unless
it can be shown that, e.g., "&ndash " actually is supposed to mean "&ndash "
more often than "– ". (Conformance is a separate issue.)

3) Will new entities ever be needed? If yes, can new entities adopt existing
conformance criteria and parsing rules? 

4) Similar considerations for entities in attribute values.

-- 
Øistein E. Andersen