Archiving Drupal
This chapter descripes the steps required to produce a static mirror of a Drupal website. The typical use case for this is a once dynamic website for an event, that becomes static after the event has passed. A dynamic website needs maintenance for security reasons. An alternative to taking it down is archiving it as a static site.
Table of contents
- Introduction
- Preparing site
- Use wget to create a static clone of the site
- Clean up the static site
- Verify that the static version works
- Final word
Introduction
This note documents the workflow used to convert the Drupal 7 NordiChi 2018 website to static HTML to be served up on a simple Apache web server without PHP, MySQL, or any other special software.
It also describes some of the steps required to keep CSS, JS and images in order, and how to avoid duplicates.
Sources:
There exists Drupal modules to export some or all of your site as static HTML:
- HTML Export (D7)
- Static Generator (D7)
- Tome (D10)
Non-drupal tool:
However, my past experience (2020) is that these does not work as well as the CLI tools described below.
The CLI tools used that are not part of the standard Unix
environment (i.e. wget, fdupes, webcheck) can be installed
using apt get
on Ubuntu 16.04 LTS.
Preparing site
If you are working from a staging site (as opposed to a production site), make sure it is fully updated. This means not only the database but the public file system, the theme, and any custom modules.
- Disable forms and any non-static-friendly modules
Make sure to all the dynamic aspects of the site are disabled. Turn off search. Remove login form. Delete all webform nodes. Turn off all forms, turn off modules that use AJAX requests (like Fivestar voting). Make sure AJAX and exposed filters are disabled in all aggregates created by Views on the site. - Delete all content that is unpublished, or that are not public facing.
There will be no login and no access control on a static website.
Do not disable:
- Scald (loses images).
Disable this core module:
- Search
Disable these contributed modules:
- CKeditor
- Fivestar
- MotherMayI
- Webform
Review other enabled modules and disable those that serves no purpose on a static site.
Use wget to create a static clone of the site
The folowing script (named staticclone.sh
) crawl the
site as the anonomous user and will create a static clone of the site
with the URL SITE
in the
direcory TARGET
.
#!/bin/sh # Script to create a static clone of a site # Source: https://www.drupal.org/node/27882 SITE=http://s.nordichi18.org TARGET=./static wget -q --mirror -p --adjust-extension -e robots=off --base=./ -nd -k -P $TARGET $SITE cd $TARGET find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done
Here's what each argument to code wget
means:
-q
- Don't write any wget output messages.
--mirror
- Turn on options suitable for mirroring, i.e.
-r -N -l info --no-remove-listing
. -p
- Download images, scripts and stylesheets so that everything works offline.
--no-check-certificate
- Ignore certificate warnings.
--html-extension
- Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html.
-e robots=off
- Disable robot exclusion so that you get everything Drupal needs.
--base=./
- Set the base URL to best resolve relative links.
-nd
- Do not create a hierarchy of directories.
-k
- Convert links to make them suitable for local viewing.
-P $TARGET
- Download into this directory.
Use the -nd
option with caution, as name-collisions
will bite. It will work without, but you will get a hierarchical
site.
The wget
-command will append query strings such
as ?itok=qRoiFlnG
to the filenames for images and
javascript. The find
-command will recursively remove
them.
This will preserve aggregates output using Views, but exposed filters will not work (so make sure there are none).
There are no arguments. For a large site, it will take some time to complete.
$ staticclone.sh
You will now have a directory containing all the files required for a static site, including images, linked files, css and javascript. Move it to a location where Apache may serve it.
Clean up the static site
First inspect one of the HTML files and determine the query string used. Edit fixclone.pl and set this to the right value in the “custom” section and two lines in the “main” section as well. The run it on all the HTML-files.
$ fixclone.pl *.html
This is a custom Perl script developed specifically for NordiChi 2018. Curently it does the following:
- Remove query sting from links to
.css
and.js
. - Remove version sting from links to
jquery….js
. - Delete single lines:
- Generator (Drupal is no longer the generator)
- shortlink (absolute URL)
- application/rss+xml (absolute URL)
- jQuery.extend (stops fdupes from working)
- Delete skip-link <div> (pointless on a static site).
Finally, perform the following tasks:
- Remove
rss.xml
andfeed
as well as all links to them. - Search for the patterm
itok=
as it may prevent image link from work. Remove by hand. - Search for absolute URLs to the legacy website and convert to relative by hand. Check that the link works.
- Look for duplicate nodes. This will typically be the case when a node has been aliased.
- Check if there are any links to the duplicate, and fix to link to the canonical version (there usually are no links).
- Remove the duplicate not designated as canonical.
- Check if there are any email links and remove them.
- Run webcheck to look for remaining problems. You may want to this first with a setting to avoid external links, and then again to check for bad external links.
Here are the CLI commands to use:
$ rm rss.xml feed $ fgrep 'itok=' * $ fgrep http://example.net * $ fdupes . $ fgrep '"N.html"' * $ fgrep 'mailto:' * $ mkdir webcheck; cd webcheck $ webcheck -a http://example.net $ rm * $ webcheck http://example.net
Link to the NordiChi 2018 webcheck report.
Verify that the static version works
Verify that the static HTML version works in a browser. Test to make sure that you properly turned off any interactive elements that will now confuse visitors. Check out that images and linked files work.
Final word
The reason one may want to create a static site archive boils down to one of these:
- Over time the website have essentially become static. Because a WCMS-based website still require security administration, an administrator has to continue to upgrade the site with security-updates. Making it static removes this burden
- When you're unable to do daily maintenance, maintain a regular Drupal site inside a firewall and copy a static HTML version of the site to a public web server before leaving or going offline.
- You may want to produce an offline copy for archiving or reference when you don't have access to the Internet.
Last update: 2019-07-27 [gh].