MS Word copy-pasting
This chapter is a case-study of how to implement copy-paste from MS Word into a Drupal node.
Table of contents
Drupal projects discussed in this chapter: CKEditor, HNM MS Word, Paste Format.
Introduction
This chapter is describing how to configure CKEditor to filter and fix text copy-pasted from MS Word.
I do this using the custom module named HNM MS Word.
HNM MS Word uses JSON to clean-up text you copy-paste into your content from other MS Word documents, web pages, email clients, etc. The main clean-up happens on paste. There is also a secondary cleanup that take place when the node is saved.
HNM MS Word may be set up to use a different text format than the format used when saving the node.
Other tools
Drupal tools: Drupal project: DOC to HTML: https://www.drupal.org/project/doc_to_html CKEditor plugin: paste from Word: https://ckeditor.com/docs/ckeditor4/latest/examples/pastefromword.html Independent solutions: Documentconverter Pro: Word cleaner: https://wordcleaner.com/ Pandoc.org: Pandoc: https://pandoc.org/
Drupal tools:
- Drupal project: DOC to HTML
- CKEditor plugin: paste from Word
Independent solutions:
- Documentconverter Pro: Word cleaner
- Pandoc.org: Pandoc
MS Word copy-paste challenges
Just copy-pasting from MS Word into a CKEditor body field produces a result that is:
- HTML full of garbage markup.
- Multi-level numbered headings transformed into ordered lists.
The garbage markup can be cleaned up by filtering.
However, the 6 heading styles of MS Word (Heading 1-6) copy-pastes as HTML h1-6
if unnumbered.
However, if MS Word is set up to use
multilevel numbered headings,
directly copy-pasting from MS Word into CKEditor,
you get some really weird markup abusing ordered lists, so …
6.5.4.3.2.1 Heading level 6
becomes
<ol> <li> <ol> <li> <ol> <li> <ol> <li> <ol> <li> <ol> <li>Heading level 6</li> </ol> </li> </ol> </li> </ol> </li> </ol> </li> </ol> </li> </ol>
This is hard to reverse. However, saving the MS Word-file
as .htm
(“Web Page, Filtered”) produces a format that
translates Word-headings into standard HTML-headings. The HNM
MS Word module relies on showing this file in a browser, and
copy-pasting from it. Copy-pasting directly from MS Word does
not work.
Configuring the input text filter
Start up my installing and enabling the modules CKEditor and HNM MS Word.
Create a text format named “MS Word Filter” by navigating to
.The “MS Word Filter” will clean out most of the MS Word gunk, but will leave just enough to allow us to do the processing described below.
HTML elements and attributes:
a[!href|title], div[align<center], p[align<center], h1,h2,h3,h4,h5,h6, em,i,strong,b,u,strike,s,code, blockquote,pre,address,sub,sup, ul,ol,li,hr,br, table,tbody,caption,tr,td
Additional settings:
- HTML comments: disabled.
- Policy for rel: disabled.
Then set up HNM MS Word:
- Enable “HNM MS Word: Plugin to cleanup pasted text” on needed CKEditor profiles.
- Grant “Use HNM MS Word” permission to user roles that will be using the above CKEditor profiles.
To configure the module, navigate to
.Select “MS Word filter” text format to clean up pasted text.
Enable the “Clean-up alert” to confirm that the setup is working.
Press “Save configuration”.
This
setup assumes that content creators can be trusted, as there is no limitation on the
content that may be created. To secure the format, install the WYSIWYG
module (which seem to filter
style
no matter what).
About HNM MS Word
HNM MS Word is a modified version of Paste Format.
The modifications are in the
file hnm_msword.pvn.inc
. It contains three
functions:
/** * Implements hook_form_FORM_ID_alter() for the node form. * * Alters form so that title is not a required field (we are going to * pull it from what is copypasted. */ function hnm_msword_form_node_form_alter(&$form, &$form_state, $form_id) /** * Implements hook_node_presave. * * Main cleanup functions. */ function hnm_msword_node_presave($node) /** * Called from function hnm_msword_cleanup() to clean up just before * completing cleanup. */ function hnm_msword_mswordclean(&$txt)
The file hnm_msword.module
is also slightly modified:
- There is a
module_load_include
to includehnm_msword.pvn.inc
. - The function
hnm_msword_cleanup()
is modified to callhnm_msword_mswordclean()
.
Tests
a_enkel_utennr.docx
. Very simple template without numbers.a_enkel_mednr.docx
. Very simple template with numbers..
.
.
.
Adapting Article for Personvernnemnda
Rename “Article” to “Vedtak”. Description: “Denne innholdstypen skal benyttes for vedtak som skal legges ut på Internett”.
Change title label to “Saksnummer og navn”.
Explanation: “La feltene «Saksnummer og navn», «Datatilsynets referanse» og «År» stå tomme eller uendret dersom du har kopiert teksten direkte fra MS Word. De vil bli fyllt ut automatisk når du lagrer.”
The table below shows the fields in this content type:
Name (human/machine ) |
Field type | Widget | Hidden |
---|---|---|---|
Påtegninger / field_anonymisert | List (text) | Radio buttons | |
Saksnummer og navn / title | Node module element | ||
Klage / field_klagesammendrag | Text | Text field | |
Datatilsynets referanse / field_dtref | Text | Text field | |
År / field_year | Integer | Text field | Yes |
Body / body | Long text | Text area | |
Tags / field_tags | Term reference | Autocomplete |
The tags fields is retained from “Article”. It is currently not used for anything, but may be used for a taxonomy later.
Final word
While this is a custom module, about 60 % of the source code is lifted from the Paste Format project.
See also DO Forum post How do I import a MS Word document into a Drupal node.
Last update: 2023-06-11 [gh].