[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Aryeh Gregor Simetrical+w3c at gmail.com
Fri Feb 4 11:59:54 PST 2011

On Fri, Feb 4, 2011 at 5:22 AM, Tim Down <timdown at gmail.com> wrote:
> It sounds less than ideal to me. From the perspective of web
> developer, that removes useful functionality. I'm not too bothered
> about innerText, but it's not hard to come up with use cases for an
> implementation of Selection.toString that returns the text that is
> visually selected on the screen rather than the trivial concatenation
> of calling toString on its Ranges. For example, a bookmarklet to
> search the web for the text the user has selected in the current page,
> or a tooltip that show content relating to the current selected text.
> I don't think it's necessary to have perfect interoperability for this
> to be useful: the current situation is not that bad, although IE9
> worsens it since it implements the Range-toString-concatenation
> approach that is in the current spec and is now being suggested again.
> I also suspect that use of Selection.toString is fairly widespread and
> browsers changing their implementation to this could break a lot of
> pages.

Actually, from what I can see, Opera, IE8, and IE9 all include
display: none text in their selection stringification
(document.selection.createRange().text in IE8's case).  Firefox also
did until recently.  So compat problems seem unlikely, and it's also
hard to call it removing useful functionality when the functionality
has never been available in most users' browsers anyway.

If the use-case is to search for a selection, it would be enough to
skip over display: none elements.  Gecko, WebKit, and IE8 all
stringify in much more complicated ways, with lots of complicated
whitespace rules.  If we decided we don't care about whitespace and
just want to skip display: none elements for Selection.toString(),
then we're talking about much simpler rules than what some browsers
currently do, so it sounds a lot more feasible.

But it's not really clear to me that this behavior would be good
enough for authors' use-cases anyway.  It's really just a convenience
function -- authors can implement the functionality pretty easily in
JavaScript.  So in the absence of clear demand, I don't know why this
has to be provided by browsers instead of JS libraries.  Generally if
a feature is readily doable from JS, we only add browser support if
it's needed for performance or the feature is very useful, right?

On Fri, Feb 4, 2011 at 10:32 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> Until they try to use it on a disconnected subtree and it does something
> weird, right?

Well, it shouldn't do weird stuff on a disconnected subree.  :)  It
doesn't in IE.

> I assume Maciej has particular use cases for them in mind?

I guess I'll let him speak for himself.

> This whole thing seems to me like an exercise in premature standardization.
>  Browsers are actively experimenting with their dom-to-text conversion APIs.
>  It'd be nice if it were happening behind vendor prefixes, but they started
> before such prefixing was popular in the DOM world.

Authors are using these features, and they're implemented
inconsistently.  If browsers are experimenting and you think there's
some chance that we'll eventually get a standardizable algorithm, then
I don't see why the new algorithm can't use a new prefixed name while
we reserve the legacy names for legacy-compatible behavior.

> Why?  What's the point?
> Or put another way, why is converting on innerText behaving like textContent
> better than converging on not having innerText at all?

>From what WebKit and Opera people have told me, innerText is necessary
for web-compat for non-Gecko browsers.  There are sites out there that
use textContent if they sniff Firefox, and innerText otherwise.
innerText apparently can't be exactly the same as textContent --
Maciej said that "I know that if <br> doesn't produce newlines, stuff
will break", and Opera does add extra newlines for <br> (but doesn't
seem to change much else).

At this point, I'm going to say we should spec innerText as

1. Let s be the empty string.

2. For each descendant of the context node in tree order:

    1. If the descendant is a text node, append its data to s.

    2. If the descendant is a <br> element, append "\n" to s.

3. Return s.

That's basically what Opera does, and it apparently works for them.
At least we'll have a spec that's known to be basically
web-compatible, rather than no spec, and it least it's easy to spec
and implement.  If not all browsers are willing to go along, well, I
don't see any behavior here that browsers will all be willing to go
along with.  I do hope Gecko will implement innerText at some point,
even if it is mostly redundant with textContent, because other
browsers reportedly can't drop it, and that's a small browser
incompatibility that could easily be removed.

I'm slightly less sure about Selection.toString(), but I'd be inclined
to take the same general approach.  It's much better for authors to
have to code around browsers not offering them enough features than to
code around browsers offering them incompatible features.  At least
then they only have to do the coding work once.

More information about the whatwg mailing list