[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

Aryeh Gregor Simetrical+w3c at gmail.com
Wed Feb 2 13:51:51 PST 2011


Currently I'm writing a specification for converting part of a DOM to
plaintext, with the eventual goal of precisely specifying innerText
and Selection.toString().  (Currently it's written only for innerText,
but adapting it to Selection is a detail.)  I'd appreciate feedback on
what I've got so far, particularly from Mozilla and WebKit people who
know about their respective implementations of this.  (I've CC'd two
Mozillans that roc recommended.)

Spec: http://aryeh.name/spec/innertext/innertext.html
Tests: http://aryeh.name/spec/innertext/test/innerText.html
Latest committed version:
http://aryeh.name/gitweb.cgi?p=innertext;a=tree (the live versions
might break sometimes as I change them)

The tests aren't really meant to be the sort of things you'd find in a
test suite, it's just a way to inspect how browsers behave.  I've
written the algorithm to match browser behavior where possible,
particularly Gecko, WebKit, and IE8 (IE9 and Opera have fairly insane
implementations of innerText).  In addition to general feedback, I'd
particularly like input on these:

* It seems like WebKit works mostly based on CSS, while Gecko more
often bases behavior on the element.  E.g., WebKit treats <div
style=white-space:pre> much like <pre>, while Gecko treats it like
<div>.  I've based the spec entirely on CSS, with no reference to
specific HTML elements, because this matches what the user sees.  A
CSS dependency is unavoidable anyway because of things like display:
none, so I see no reason to minimize it.

* Currently the algorithm ignores generated content, matching all
browsers.  I think it should generally include generated content,
because that's what's visible to the user.  Would implementers be
willing to do this?  (If generated content is included, I won't need
to special-case <br>, given the HTML5 default stylesheet.)

* Are there any important special cases in how browsers behave that my
tests omit?  I haven't tested all the display types yet, for example.

Keep in mind that the spec is probably not close to finished, and it
could be rewritten from scratch if necessary.


More information about the whatwg mailing list