MS Word copy-pasting

by Gisle Hannemyr

This chapter is a case-study of how to implement copy-paste from MS Word into a Drupal node.

Table of contents

Drupal projects discussed in this chapter: CKEditor, HNM MS Word, Paste Format.

Introduction

This chapter is describing how to configure CKEditor to filter and fix text copy-pasted from MS Word.

I do this using the custom module named HNM MS Word.

HNM MS Word uses JSON to clean-up text you copy-paste into your content from other MS Word documents, web pages, email clients, etc. The main clean-up happens on paste. There is also a secondary cleanup that take place when the node is saved.

HNM MS Word may be set up to use a different text format than the format used when saving the node.

Other tools

Drupal tools:
    Drupal project: DOC to HTML: https://www.drupal.org/project/doc_to_html
    CKEditor plugin: paste from Word: https://ckeditor.com/docs/ckeditor4/latest/examples/pastefromword.html

Independent solutions:
    Documentconverter Pro: Word cleaner: https://wordcleaner.com/
    Pandoc.org: Pandoc: https://pandoc.org/

Drupal tools:

  1. Drupal project: DOC to HTML
  2. CKEditor plugin: paste from Word

Independent solutions:

  1. Documentconverter Pro: Word cleaner
  2. Pandoc.org: Pandoc

MS Word copy-paste challenges

Just copy-pasting from MS Word into a CKEditor body field produces a result that is:

  1. HTML full of garbage markup.
  2. Multi-level numbered headings transformed into ordered lists.

The garbage markup can be cleaned up by filtering.

However, the 6 heading styles of MS Word (Heading 1-6) copy-pastes as HTML h1-6 if unnumbered. However, if MS Word is set up to use multilevel numbered headings, directly copy-pasting from MS Word into CKEditor, you get some really weird markup abusing ordered lists, so …

6.5.4.3.2.1 Heading level 6

becomes

<ol>
 <li>
 <ol>
  <li>
  <ol>
   <li>
   <ol>
    <li>
    <ol>
     <li>
     <ol>
      <li>Heading level 6</li>
     </ol>
     </li>
    </ol>
    </li>
   </ol>
   </li>
  </ol>
  </li>
 </ol>
 </li>
</ol>

This is hard to reverse. However, saving the MS Word-file as .htm (“Web Page, Filtered”) produces a format that translates Word-headings into standard HTML-headings. The HNM MS Word module relies on showing this file in a browser, and copy-pasting from it. Copy-pasting directly from MS Word does not work.

Configuring the input text filter

Start up my installing and enabling the modules CKEditor and HNM MS Word.

Create a text format named “MS Word Filter” by navigating to Configuration » Content authoring » Text formats » Add text format.

The “MS Word Filter” will clean out most of the MS Word gunk, but will leave just enough to allow us to do the processing described below.

HTML elements and attributes:

a[!href|title],
div[align<center],
p[align<center],
h1,h2,h3,h4,h5,h6,
em,i,strong,b,u,strike,s,code,
blockquote,pre,address,sub,sup,
ul,ol,li,hr,br,
table,tbody,caption,tr,td

Additional settings:

Then set up HNM MS Word:

To configure the module, navigate to Configuration » Content authoring » HNM MS Word.

Select “MS Word filter” text format to clean up pasted text.

Enable the “Clean-up alert” to confirm that the setup is working.

Press “Save configuration”.

noteThis setup assumes that content creators can be trusted, as there is no limitation on the content that may be created. To secure the format, install the WYSIWYG module (which seem to filter style no matter what).

About HNM MS Word

HNM MS Word is a modified version of Paste Format. The modifications are in the file hnm_msword.pvn.inc. It contains three functions:

/**
 * Implements hook_form_FORM_ID_alter() for the node form.
 *
 * Alters form so that title is not a required field (we are going to
 * pull it from what is copypasted.
 */
function hnm_msword_form_node_form_alter(&$form, &$form_state, $form_id)

/**
 * Implements hook_node_presave.
 *
 * Main cleanup functions.
 */
function hnm_msword_node_presave($node)

/**
 * Called from function hnm_msword_cleanup() to clean up just before
 * completing cleanup.
 */
function hnm_msword_mswordclean(&$txt)

The file hnm_msword.module is also slightly modified:

Tests

Adapting Article for Personvernnemnda

Rename “Article” to “Vedtak”. Description: “Denne innholdstypen skal benyttes for vedtak som skal legges ut på Internett”.

Change title label to “Saksnummer og navn”.

Explanation: “La feltene «Saksnummer og navn», «Datatilsynets referanse» og «År» stå tomme eller uendret dersom du har kopiert teksten direkte fra MS Word. De vil bli fyllt ut automatisk når du lagrer.”

The table below shows the fields in this content type:

The fields in “Vedtak”
Name (human/machine) Field type Widget Hidden
Påtegninger / field_anonymisertList (text)Radio buttons 
Saksnummer og navn / titleNode module element  
Klage / field_klagesammendragTextText field 
Datatilsynets referanse / field_dtrefTextText field 
År / field_yearIntegerText fieldYes
Body / bodyLong textText area 
Tags / field_tagsTerm referenceAutocomplete 

The tags fields is retained from “Article”. It is currently not used for anything, but may be used for a taxonomy later.

Final word

While this is a custom module, about 60 % of the source code is lifted from the Paste Format project.

See also DO Forum post How do I import a MS Word document into a Drupal node.


Last update: 2023-06-11 [gh].