[whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

Mon Jan 23 16:18:31 PST 2012

On Wed, 3 Aug 2011, Aryeh Gregor wrote:
>
> Hixie just WONTFIXed two bugs that I thought might be of interest:
> 
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=12220
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296
> 
> Basically, HTML defines some algorithms for parsing integers, floats, 
> etc., which are used in converting DOM to IDL attributes for reflection 
> (among other things):
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#numbers
> 
> The algorithms for parsing integers and floats are almost exactly the 
> same as ECMAScript's parseInt() and parseFloat(), down to some of the 
> language being copied word-for-word, but with subtle differences 
> involving (at least) whitespace handling.  IMO, this is bad for several 
> reasons:
> 
> * It's confusing to both authors and implementers to have multiple 
> almost identical algorithms.  Nobody's going to expect the discrepancy 
> in the corner cases where it matters.
>
> * It's confusing to people reading the spec for there to be these extra 
> algorithms defined, whose relationship to the ES algorithms is not 
> obvious.  The HTML and ES algorithms are written in entirely different 
> styles and it's hard to tell what the differences are from side-by-side 
> inspections.
>
> * In at least some cases, all browsers match ES and none match the spec 
> -- see <http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4>.
>
> * Browsers will have to maintain the ES algorithms as well as the HTML 
> algorithms, so even if the HTML algorithms are superior, it doesn't save 
> anyone the effort of understanding or implementing the ES algorithms.
> 
> So I think HTML should just defer to ES here.

The reasons for not doing so are listed are:

 - The exact ES algorithm would need preprocessing anyway, to exclude 
   values like Infinity or NaN.

 - Having the algorithm depend on Unicode would mean HTML processing would 
   change over time without good reason. There's no need to support 
   non-ASCII characters in numeric attributes. (HTML generally is designed 
   to only use ASCII characters.)

 - It's simpler to implement from scratch if the HTML spec just defines 
   the algorithm than having to defer to another spec. This is especially 
   the case because the JS algorithms support features we don't need, e.g. 
   parseInt() supports a radix argument, and because the rules for parsing 
   floats in HTML are significantly more straight-forward than in ES.

 - The JS algorithms allow approximations that are unnecessary to support 
   in the HTML spec.

 - If you're writing an HTML tool, it's simpler to just use an HTML 
   library that defines the HTML algorithms than use both an HTML library 
   _and_ a JS library.

 - If you're writing a library, it's simpler to not have to include a JS 
   library just for a few parsing primitives.

 - If you're not going to use another library, then there's nothing gained 
   from referencing another spec.

 - It's simpler to spec and to understand if we're not deferring to other 
   specs for simple things like microsyntax parsers.

On Thu, 4 Aug 2011, Jonas Sicking wrote:
> 
> It would make sense to me to match ES here.

With the exception of the definition of "leading white space" and how 
approximations are handled in the face of hardware limitations, we do 
match ES. An implementation that wanted to share common code here would be 
able to already.

On Fri, 5 Aug 2011, Jonas Sicking wrote:
> 
> Sounds good. I'm for such a change yes.

There are two possible changes here: making the HTML spec's definition of 
parsing numbers use Unicode's varying definition of whitespace rather than 
a small set, making HTML parsing depend on non-ASCII values, or, just 
referencing the JS spec directly. For the reasons described above, I have 
not done either at this time.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'