Archiving Drupal

by Gisle Hannemyr

This chapter descripes the steps required to produce a static mirror of a Drupal website. The typical use case for this is a once dynamic website for an event, that becomes static after the event has passed. A dynamic website needs maintenance for security reasons. An alternative to taking it down is archiving it as a static site.

Introduction
Preparing site
Use wget to create a static clone of the site
Clean up the static site
Verify that the static version works
Final word

Introduction

This note documents the workflow used to convert the Drupal 7 NordiChi 2018 website to static HTML to be served up on a simple Apache web server without PHP, MySQL, or any other special software.

It also describes some of the steps required to keep CSS, JS and images in order, and how to avoid duplicates.

Sources:

There exists Drupal modules to export some or all of your site as static HTML:

Non-drupal tool:

HTTrack

However, my past experience (2020) is that these does not work as well as the CLI tools described below.

The CLI tools used that are not part of the standard Unix environment (i.e. wget, fdupes, webcheck) can be installed using apt get on Ubuntu 16.04 LTS.

Preparing site

If you are working from a staging site (as opposed to a production site), make sure it is fully updated. This means not only the database but the public file system, the theme, and any custom modules.

Disable forms and any non-static-friendly modules
Make sure to all the dynamic aspects of the site are disabled. Turn off search. Remove login form. Delete all webform nodes. Turn off all forms, turn off modules that use AJAX requests (like Fivestar voting). Make sure AJAX and exposed filters are disabled in all aggregates created by Views on the site.
Delete all content that is unpublished, or that are not public facing.
There will be no login and no access control on a static website.

Do not disable:

Scald (loses images).

Disable this core module:

Search

Disable these contributed modules:

CKeditor
Fivestar
MotherMayI
Webform

Review other enabled modules and disable those that serves no purpose on a static site.

Use wget to create a static clone of the site

The folowing script (named staticclone.sh) crawl the site as the anonomous user and will create a static clone of the site with the URL SITE in the direcory TARGET.

#!/bin/sh
# Script to create a static clone of a site
# Source: https://www.drupal.org/node/27882

SITE=http://s.nordichi18.org
TARGET=./static
wget -q --mirror -p --adjust-extension -e robots=off --base=./ -nd -k -P $TARGET $SITE
cd $TARGET
find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done

Here's what each argument to code wget means:

-q: Don't write any wget output messages.
--mirror: Turn on options suitable for mirroring, i.e. -r -N -l info --no-remove-listing.
-p: Download images, scripts and stylesheets so that everything works offline.
--no-check-certificate: Ignore certificate warnings.
--html-extension: Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html.
-e robots=off: Disable robot exclusion so that you get everything Drupal needs.
--base=./: Set the base URL to best resolve relative links.
-nd: Do not create a hierarchy of directories.
-k: Convert links to make them suitable for local viewing.
-P $TARGET: Download into this directory.

Use the -nd option with caution, as name-collisions will bite. It will work without, but you will get a hierarchical site.

The wget-command will append query strings such as ?itok=qRoiFlnG to the filenames for images and javascript. The find-command will recursively remove them.

This will preserve aggregates output using Views, but exposed filters will not work (so make sure there are none).

There are no arguments. For a large site, it will take some time to complete.

$ staticclone.sh

You will now have a directory containing all the files required for a static site, including images, linked files, css and javascript. Move it to a location where Apache may serve it.

Clean up the static site

First inspect one of the HTML files and determine the query string used. Edit fixclone.pl and set this to the right value in the “custom” section and two lines in the “main” section as well. The run it on all the HTML-files.

$ fixclone.pl *.html

This is a custom Perl script developed specifically for NordiChi 2018. Curently it does the following:

Remove query sting from links to .css and .js.
Remove version sting from links to jquery….js.
Delete single lines:
- Generator (Drupal is no longer the generator)
- shortlink (absolute URL)
- application/rss+xml (absolute URL)
- jQuery.extend (stops fdupes from working)
Delete skip-link <div> (pointless on a static site).

Finally, perform the following tasks:

Remove rss.xml and feed as well as all links to them.
Search for the patterm itok= as it may prevent image link from work. Remove by hand.
Search for absolute URLs to the legacy website and convert to relative by hand. Check that the link works.
Look for duplicate nodes. This will typically be the case when a node has been aliased.
Check if there are any links to the duplicate, and fix to link to the canonical version (there usually are no links).
Remove the duplicate not designated as canonical.
Check if there are any email links and remove them.
Run webcheck to look for remaining problems. You may want to this first with a setting to avoid external links, and then again to check for bad external links.

Here are the CLI commands to use:

$ rm rss.xml feed
$ fgrep 'itok=' *
$ fgrep http://example.net *
$ fdupes .
$ fgrep '"N.html"' *
$ fgrep 'mailto:' *
$ mkdir webcheck; cd webcheck
$ webcheck -a http://example.net
$ rm *
$ webcheck http://example.net

Link to the NordiChi 2018 webcheck report.

Verify that the static version works

Verify that the static HTML version works in a browser. Test to make sure that you properly turned off any interactive elements that will now confuse visitors. Check out that images and linked files work.

Final word

The reason one may want to create a static site archive boils down to one of these:

Over time the website have essentially become static. Because a WCMS-based website still require security administration, an administrator has to continue to upgrade the site with security-updates. Making it static removes this burden
When you're unable to do daily maintenance, maintain a regular Drupal site inside a firewall and copy a static HTML version of the site to a public web server before leaving or going offline.
You may want to produce an offline copy for archiving or reference when you don't have access to the Internet.

Last update: 2019-07-27 [gh].