[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Wed Feb 2 14:30:36 PST 2011

On 2/2/11 4:51 PM, Aryeh Gregor wrote:
> I've based the spec entirely on CSS, with no reference to
> specific HTML elements, because this matches what the user sees.

This doesn't work for disconnected subtrees.  Or rather, it presupposes 
certain things about the browser's architecture that I don't think we 
want to presuppose.

That may be ok for Selection (though not sure it is for programmatic 
ones; see https://bugzilla.mozilla.org/show_bug.cgi?id=585229), but I 
fail to see why it's OK for a DOM property like innerText.

> A CSS dependency is unavoidable anyway because of things like display:
> none, so I see no reason to minimize it.

Note that until recently Gecko had no such dependency in 
selection.toString().   We made some changes because of the "it's not 
what the user sees" issue, but it's a pretty complicated problem, 
because due to CSS out-of-flows "what the user sees" and "a DOM range" 
might have very little to do with each other.

You may want to read https://bugzilla.mozilla.org/show_bug.cgi?id=39098 
for some background on this part.

> * Currently the algorithm ignores generated content, matching all
> browsers.  I think it should generally include generated content,
> because that's what's visible to the user.  Would implementers be
> willing to do this?

Generated content is tough, because there's no way to capture it with 
DOM ranges.  So if you're using DOM ranges to represent your selections, 
there's just no sane way to handle generated content.

I assume you also read the non-noise parts of 
https://bugzilla.mozilla.org/show_bug.cgi?id=12460 (but they mostly say 
what I said above).

> * Are there any important special cases in how browsers behave that my
> tests omit?  I haven't tested all the display types yet, for example.

Looking briefly over the code we use to serialize to text for copy/paste 
(but also for other purposes, so this code has several different modes, 
which complicates things), there's stuff there to deal specially with 
tabs, nested ordered lists, <h*> vertical spacing and indentation, 
non-breaking spaces, blockquote (especially of type="cite"), 
noscript/noframes/iframe, , <pre> (especially inside blockquotes), 
<tr>, <td>/<th>, <dl>/<dt>, (nesting level affects whether pretty 
line-wrapping happens or something like that), <q>, tags that are 
"block-level" in the HTML4 sense, and , <code>, and 
, and , .

Plus there's the black magic about when to rewrap things and when to 
preserve the original whitespace or whatnot.

See 
http://hg.mozilla.org/mozilla-central/file/1c2d53a2dcfb/content/base/src/nsPlainTextSerializer.cpp 
for details.

I should note that it's not clear to me how much we want to standardize 
what browsers actually copy when the user copies.  This seems like 
something that users may want to configure and where we want to let 
browsers experiment with heuristics and such; I have a really hard time 
believing that the current browser behavior here is the best we can do.

That leaves the question of whether Selection.toString should produce 
the same string as the user copying and pasting would, of course. 
Perhaps it shouldn't.  I'm not sure we'd want to make what toString 
produce depend on new CSS layout modes, for example, since that could 
break scripts... but the user-facing copied text might want to depend on 
those.  Something like CSS3 Template Layout is even less amenable to 
having selections represented as a range than what people do now with 
out-of-flows.

-Boris