[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Wed Feb 2 16:50:06 PST 2011

On Wed, Feb 2, 2011 at 5:30 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> This doesn't work for disconnected subtrees.  Or rather, it presupposes
> certain things about the browser's architecture that I don't think we want
> to presuppose.

Specifically what?  That browsers might not resolve CSS for
disconnected subtrees?  Note that AFAICT, WebKit treats innerText like
textContent for such subtrees, and Gecko returns the empty string when
you stringify a Selection that's not displayed.  This seems
unreasonable from an author perspective, but it's not a big deal, so I
can spec something different if it would be simpler for browsers.

(Not sure what it should be, though.  Empty string, textContent-like
behavior, or something that behaves like the normal algorithm except
ignoring CSS?  The latter seems like the most complicated by far.  I'd
lean toward an empty string, because it seems the least mysterious.)

> That may be ok for Selection (though not sure it is for programmatic ones;
> see https://bugzilla.mozilla.org/show_bug.cgi?id=585229), but I fail to see
> why it's OK for a DOM property like innerText.

In WebKit, innerText is essentially the same as selecting the node and
stringifying the Selection -- they use the same code and produce
almost exactly the same results in my tests (modulo stuff like
trailing newlines).  So maybe it shouldn't have been a DOM property,
but that's how it works.  IE8 behaves similarly.

> Note that until recently Gecko had no such dependency in
> selection.toString().   We made some changes because of the "it's not what
> the user sees" issue, but it's a pretty complicated problem, because due to
> CSS out-of-flows "what the user sees" and "a DOM range" might have very
> little to do with each other.
>
> You may want to read https://bugzilla.mozilla.org/show_bug.cgi?id=39098 for
> some background on this part.

What do you mean by "out-of-flows"?  Clearly we can't do better than
just an approximation here, since we're not going to handle stuff like
absolute positioning and so on.

> Generated content is tough, because there's no way to capture it with DOM
> ranges.  So if you're using DOM ranges to represent your selections, there's
> just no sane way to handle generated content.

>From a UI perspective it's weird, yeah, but it doesn't seem hard.
You'd have to have the selection jump, so that it includes either the
whole stretch of generated content or none of it.  This is the way the
UI looks in Gecko right now for images that are displaying their alt
text, like:

data:text/html,<img alt=test>

>From a programmatic perspective, it's also fairly straightforward to
see how it would work, as long as you don't demand that it be possible
to partially select generated content.  Of course, it might not be
straightforward to implement.

> Looking briefly over the code we use to serialize to text for copy/paste
> (but also for other purposes, so this code has several different modes,
> which complicates things), there's stuff there to deal specially with tabs,
> nested ordered lists, <h*> vertical spacing and indentation, non-breaking
> spaces, blockquote (especially of type="cite"), noscript/noframes/iframe,
> <p>, <pre> (especially inside blockquotes), <tr>, <td>/<th>, <dl>/<dt>,
> <span> (nesting level affects whether pretty line-wrapping happens or
> something like that), <q>, tags that are "block-level" in the HTML4 sense,
> <sup> and <sub>, <code>, <strong> and <b>, <em> and <i>, <u>.
>
> Plus there's the black magic about when to rewrap things and when to
> preserve the original whitespace or whatnot.
>
> See
> http://hg.mozilla.org/mozilla-central/file/1c2d53a2dcfb/content/base/src/nsPlainTextSerializer.cpp
> for details.

Thanks, I'll test those and take a look at that code.

> I should note that it's not clear to me how much we want to standardize what
> browsers actually copy when the user copies.  This seems like something that
> users may want to configure and where we want to let browsers experiment
> with heuristics and such; I have a really hard time believing that the
> current browser behavior here is the best we can do.

This occurred to me too.  It seems like a must to standardize how
innerText and Selection.toString() behave, because those are visible
to script and pretty widely used, and the interop story right now is
terrible.  Of course, there's nothing to stop implementations from
experimenting and passing the improvements back to the spec.

> That leaves the question of whether Selection.toString should produce the
> same string as the user copying and pasting would, of course. Perhaps it
> shouldn't.  I'm not sure we'd want to make what toString produce depend on
> new CSS layout modes, for example, since that could break scripts... but the
> user-facing copied text might want to depend on those.

I'm not sure why it would break many existing pages if it only kicks
in with new layout modes.  But maybe I don't have a good enough grasp
on how these functions are actually used.  I should probably comb
through a sample of web pages to see people use this stuff.
(Unfortunately it's not so easy to search for Selection
stringification, but I can look for innerText.)