[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())
bzbarsky at MIT.EDU
Wed Feb 2 14:30:36 PST 2011
On 2/2/11 4:51 PM, Aryeh Gregor wrote:
> I've based the spec entirely on CSS, with no reference to
> specific HTML elements, because this matches what the user sees.
This doesn't work for disconnected subtrees. Or rather, it presupposes
certain things about the browser's architecture that I don't think we
want to presuppose.
That may be ok for Selection (though not sure it is for programmatic
ones; see https://bugzilla.mozilla.org/show_bug.cgi?id=585229), but I
fail to see why it's OK for a DOM property like innerText.
> A CSS dependency is unavoidable anyway because of things like display:
> none, so I see no reason to minimize it.
Note that until recently Gecko had no such dependency in
selection.toString(). We made some changes because of the "it's not
what the user sees" issue, but it's a pretty complicated problem,
because due to CSS out-of-flows "what the user sees" and "a DOM range"
might have very little to do with each other.
You may want to read https://bugzilla.mozilla.org/show_bug.cgi?id=39098
for some background on this part.
> * Currently the algorithm ignores generated content, matching all
> browsers. I think it should generally include generated content,
> because that's what's visible to the user. Would implementers be
> willing to do this?
Generated content is tough, because there's no way to capture it with
DOM ranges. So if you're using DOM ranges to represent your selections,
there's just no sane way to handle generated content.
I assume you also read the non-noise parts of
https://bugzilla.mozilla.org/show_bug.cgi?id=12460 (but they mostly say
what I said above).
> * Are there any important special cases in how browsers behave that my
> tests omit? I haven't tested all the display types yet, for example.
Looking briefly over the code we use to serialize to text for copy/paste
(but also for other purposes, so this code has several different modes,
which complicates things), there's stuff there to deal specially with
tabs, nested ordered lists, <h*> vertical spacing and indentation,
non-breaking spaces, blockquote (especially of type="cite"),
noscript/noframes/iframe, <p>, <pre> (especially inside blockquotes),
<tr>, <td>/<th>, <dl>/<dt>, <span> (nesting level affects whether pretty
line-wrapping happens or something like that), <q>, tags that are
"block-level" in the HTML4 sense, <sup> and <sub>, <code>, <strong> and
<b>, <em> and <i>, <u>.
Plus there's the black magic about when to rewrap things and when to
preserve the original whitespace or whatnot.
I should note that it's not clear to me how much we want to standardize
what browsers actually copy when the user copies. This seems like
something that users may want to configure and where we want to let
browsers experiment with heuristics and such; I have a really hard time
believing that the current browser behavior here is the best we can do.
That leaves the question of whether Selection.toString should produce
the same string as the user copying and pasting would, of course.
Perhaps it shouldn't. I'm not sure we'd want to make what toString
produce depend on new CSS layout modes, for example, since that could
break scripts... but the user-facing copied text might want to depend on
those. Something like CSS3 Template Layout is even less amenable to
having selections represented as a range than what people do now with
More information about the whatwg