[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Boris Zbarsky bzbarsky at MIT.EDU
Thu Feb 3 12:15:38 PST 2011


On 2/2/11 7:50 PM, Aryeh Gregor wrote:
> On Wed, Feb 2, 2011 at 5:30 PM, Boris Zbarsky<bzbarsky at mit.edu>  wrote:
>> This doesn't work for disconnected subtrees.  Or rather, it presupposes
>> certain things about the browser's architecture that I don't think we want
>> to presuppose.
>
> Specifically what?  That browsers might not resolve CSS for
> disconnected subtrees?

Indeed.

> Note that AFAICT, WebKit treats innerText like
> textContent for such subtrees

OK...  See, that's the sort of behavior change for a DOM API that I 
don't think we should have.  Why do we want a DOM API which looks like a 
way to serialize the DOM but actually works totally differently in 
disconnected subtrees and a displayed document?

> and Gecko returns the empty string when
> you stringify a Selection that's not displayed.  This seems
> unreasonable from an author perspective

Well, what exactly would "reasonable" be?

>> That may be ok for Selection (though not sure it is for programmatic ones;
>> see https://bugzilla.mozilla.org/show_bug.cgi?id=585229), but I fail to see
>> why it's OK for a DOM property like innerText.
>
> In WebKit, innerText is essentially the same as selecting the node and
> stringifying the Selection

Yes, I understand that's what Webkit does.  I just think it's a terrible 
idea.

>> Note that until recently Gecko had no such dependency in
>> selection.toString().   We made some changes because of the "it's not what
>> the user sees" issue, but it's a pretty complicated problem, because due to
>> CSS out-of-flows "what the user sees" and "a DOM range" might have very
>> little to do with each other.
>>
>> You may want to read https://bugzilla.mozilla.org/show_bug.cgi?id=39098 for
>> some background on this part.
>
> What do you mean by "out-of-flows"?

Floating and absolutely positioned (in the CSS spec sense) elements.

> Clearly we can't do better than
> just an approximation here, since we're not going to handle stuff like
> absolute positioning and so on.

Whyever not?  I think browsers should be allowed to try to handle it in 
their selection implementations if they want to try!

>> Generated content is tough, because there's no way to capture it with DOM
>> ranges.  So if you're using DOM ranges to represent your selections, there's
>> just no sane way to handle generated content.
>
>  From a UI perspective it's weird, yeah, but it doesn't seem hard.
> You'd have to have the selection jump, so that it includes either the
> whole stretch of generated content or none of it.

I'm not sure how well that would work with some of the CSS3 generated 
content proposals.

> This is the way the
> UI looks in Gecko right now for images that are displaying their alt
> text

Note that the "UI" you're looking at there is basically an accident.  ;)

>  From a programmatic perspective, it's also fairly straightforward to
> see how it would work, as long as you don't demand that it be possible
> to partially select generated content.

Why wouldn't you, if you can select it at all?

> This occurred to me too.  It seems like a must to standardize how
> innerText and Selection.toString() behave, because those are visible
> to script and pretty widely used, and the interop story right now is
> terrible.  Of course, there's nothing to stop implementations from
> experimenting and passing the improvements back to the spec.

I suspect that once we standardize those, they will be frozen (see 
below).  So at that point, Selection.toString and actually copying will 
diverge.

I should note that Gecko doesn't support innerText, and we haven't had a 
single bug report about it not working or request to implement it in the 
last 4 years.  So I question how widely used it is...  Maybe it's 
useful, but I'd need to understand the use cases first.  What are they?

>> That leaves the question of whether Selection.toString should produce the
>> same string as the user copying and pasting would, of course. Perhaps it
>> shouldn't.  I'm not sure we'd want to make what toString produce depend on
>> new CSS layout modes, for example, since that could break scripts... but the
>> user-facing copied text might want to depend on those.
>
> I'm not sure why it would break many existing pages if it only kicks
> in with new layout modes.

Sure they will.  The issue will be that browsers that support the new 
layout mode will return one thing while browsers that are getting the 
fallback layout will return something else.  So you'll get lack of 
interoperability, and breakage in whichever browsers the page author 
happened not to test in.

At least assuming anyone actually cares about the details of the values 
Selection.toString() produces.  And if no one does, then we shouldn't be 
standardizing them, imo.

-Boris



More information about the whatwg mailing list