[whatwg] Possible bug in the character encoding detection algorithm
James Graham
jg307 at cam.ac.uk
Fri Mar 2 15:02:27 PST 2007
Given the following line of input:
<a b='c'>
012345678 - byte numbers for reference
I believe the steps in the spec have the following effect:
Match <a
Advance position to 2
Get an attribute
Advance position to 3
Attribute Name = b
Advance position to 4
Jump to step labeled "value"
(Presumably at this point we want to advance to position 5; this is not
mentioned)
b = '
Advance position to 6
Attribute Value = c
Advance position to 7
Stop looking for an attribute
Get an attribute
Attribute Name = '
Advance Position to 8
Stop Looking for an attribute
Retract position to 7
Stop looking for an attribute
Get an attribute...
this seems to lead to an infinite loop (IIRC the same thing happens for
unquoted values). html5lib currently sidesteps the issue by not moving
the position back one after finding an attribute. This fails to locate
the character encoding in e.g.:
<meta http-equiv="Content-Type<meta charset="utf-8">
Obviously one possibility is to get all attributes and then, if the
current byte is ASCII < move the position back one.
--
"The universe doesn't care what you believe. The wonderful thing about
science is that it doesn't ask for your faith, it just asks for your
eyes" --- http://xkcd.com/c154.html
More information about the whatwg
mailing list