[whatwg] converting word (was <code> attributes

Adrian Sutton adrian.sutton at ephox.com
Fri May 1 04:22:32 PDT 2009

On 01/05/2009 12:03, "Charles McCathieNevile" <chaals at opera.com> wrote:
> This is an oversimplification to the point of being misleading.
> There are many ways to use Word, and many people and organisations with
> haf a clue use it in such a way that automatic conversion can be
> relatively easily used to generate highly semantically rich and valid
> markup - much better than the sort of tripe one typically finds on the Web
> today.

I can verify this with a lot of real world experience.  Ephox has been
selling a WYSIWYG editor for around 10 years now and the single most popular
feature is it's ability to copy and paste clean HTML from Microsoft Word.
The resulting HTML brings over the structure of the document (headings,
tables, lists, images etc) but not the inline formatting so the content then
matches the CSS.  The formatting and styles can optionally be brought over
as well but in the last 3-5 years popular demand has been to match the site
stylesheet rather than the original formatting.

The biggest challenge in this is actually removing the huge amount of inline
formatting and proprietary tags/attributes that Microsoft Word adds.  In the
latest versions it's also a challenge to put lists back together as actual
HTML lists since Word has started exporting them as paragraphs with a bullet
from the symbol font and lots of nbsps.

Pretty much every editor I know of can preserve the semantic information
from Word when copying and pasting, they all vary in how well they strip out
the inline formatting and proprietary tags with most doing a fairly poor job
of this second part.


Adrian Sutton.
Adrian Sutton, CTO
UK: +44 1 753 27 2229  US: +1 (650) 292 9659 x717
Ephox <http://www.ephox.com/>
Ephox Blogs <http://planet.ephox.com/>, Personal Blog

More information about the whatwg mailing list