[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Aryeh Gregor Simetrical+w3c at gmail.com
Thu Feb 3 17:25:51 PST 2011


On Thu, Feb 3, 2011 at 4:41 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> And all I'm saying is that there are at least three pieces of data here:
>
> 1)  innerText return value
> 2)  Selection.toString() return value
> 3)  What the browser actually copies
>
> My point is that browsers must be free to modify #3 as desired. Dictating it
> in a web spec, is not acceptable, imo.

Sure.

> Agreed, I think; but should that be Selection.toString() or some other API?
>  That is are we hijacking Selection.toString() because it's convenient, or
> because it's the right way to expose such an algorithm?

innerText seems like a reasonable place to put such an API, if only
because WebKit already does it.  It's not ideal a priori, but by the
consistency standards of the web platform it's not noticeably bad.  I
should particularly point out that your typical author is not going to
have the foggiest notion of separation of DOM and CSS and so on -- it
will make intuitive sense to authors to have it at innerText as much
as anywhere.

I did actually find a couple of sites that defined functions that
accepted an HTML string, created a div, assigned the HTML to the div's
innerHTML, and returned the innerText (or textContent if innerText is
unavailable):

http://api.opencast.naver.com/CS888/23
http://bbs.ptbus.com/thread-22143-1-1.html

I didn't find what they were actually used for, though.  Note that
this breaks if innerText doesn't work correctly for non-displayed
elements, so basically it will only do any prettification in IE.

> Depending on your definition of "okay", yes.  I mean... we have an "okay"
> way that's interoperable now (I hope): Range.toString.  Except you don't
> think it does an okay job, clearly.  I agree on that; I don't necessarily
> agree that current browser Selection.toString does an "okay" job.

Actually, if browsers are willing to converge on making innerText work
like textContent and Selection.toString() work like Range.toString(),
I'd be okay with that.  There are use-cases for a standardized
plaintext conversion API, but at this point I think they're too
marginal to be worth the effort of actually specifying and
implementing.  Such an API is inherently going to be either not very
good or unreasonably complicated.  There's no reason at all that you
couldn't implement such an API in a JavaScript library -- I don't see
why it has to be part of the web platform.

I've been told Opera doesn't care about this and will implement
whatever is specced as long as it's web-compatible and not too
complicated to be worth the effort.  Gecko (at least that portion that
I'm talking to :) ) seems to be skeptical of implementing anything
very complicated here either.  But Maciej has told me that WebKit
doesn't want to scrap its elaborate plaintext-conversion APIs (which
have by far the best fidelity of any browser's from what I see).

So as I see it, the easiest solution would be for WebKit to agree to
move its APIs to prefixed versions if it wants to keep them, and
change behavior of the unprefixed ones to something like textContent.
(Possibly with minimal differences for web-compat -- Opera's behave
slightly differently, and IIRC I was told it's for web-compat
reasons.)

On the other hand, if WebKit is unwilling to accept anything other
than a complicated plaintext conversion algorithm here, I don't think
we're going to have interop in the foreseeable future no matter what.
Even if it gets specced, no one will want to implement it.  I'm not
clear on whether WebKit would be willing to implement a standardized
algorithm either, given the nonexistent web-compat issues.  So in that
case I'd try to ask Microsoft, and unless they side with WebKit, we
can at least have everyone but WebKit converge on
innerText/Selection.toString() behaving as similarly to
textContent/Range.toString() as possible.

How does that sound to everyone?


More information about the whatwg mailing list