Archiving Drupal

by Gisle Hannemyr

This chapter descripes the steps required to produce a static mirror of a Drupal website. The typical use case for this is a once dynamic website for an event, that becomes static after the event has passed. A dynamic website needs maintenance for security reasons. An alternative to taking it down is archiving it as a static site.

Table of contents

Introduction

This note documents the workflow used to convert the Drupal 7 NordiChi 2018 website to static HTML to be served up on a simple Apache web server without PHP, MySQL, or any other special software.

It also describes some of the steps required to keep CSS, JS and images in order, and how to avoid duplicates.

Sources:

There exists Drupal modules to export some or all of your site as static HTML:

Non-drupal tool:

However, my past experience (2020) is that these does not work as well as the CLI tools described below.

The CLI tools used that are not part of the standard Unix environment (i.e. wget, fdupes, webcheck) can be installed using apt get on Ubuntu 16.04 LTS.

Preparing site

If you are working from a staging site (as opposed to a production site), make sure it is fully updated. This means not only the database but the public file system, the theme, and any custom modules.

  1. Disable forms and any non-static-friendly modules
    Make sure to all the dynamic aspects of the site are disabled. Turn off search. Remove login form. Delete all webform nodes. Turn off all forms, turn off modules that use AJAX requests (like Fivestar voting). Make sure AJAX and exposed filters are disabled in all aggregates created by Views on the site.
  2. Delete all content that is unpublished, or that are not public facing.
    There will be no login and no access control on a static website.

Do not disable:

Disable this core module:

Disable these contributed modules:

Review other enabled modules and disable those that serves no purpose on a static site.

Use wget to create a static clone of the site

The folowing script (named staticclone.sh) crawl the site as the anonomous user and will create a static clone of the site with the URL SITE in the direcory TARGET.

#!/bin/sh
# Script to create a static clone of a site
# Source: https://www.drupal.org/node/27882

SITE=http://s.nordichi18.org
TARGET=./static
wget -q --mirror -p --adjust-extension -e robots=off --base=./ -nd -k -P $TARGET $SITE
cd $TARGET
find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done

Here's what each argument to code wget means:

-q
Don't write any wget output messages.
--mirror
Turn on options suitable for mirroring, i.e. -r -N -l info --no-remove-listing.
-p
Download images, scripts and stylesheets so that everything works offline.
--no-check-certificate
Ignore certificate warnings.
--html-extension
Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html.
-e robots=off
Disable robot exclusion so that you get everything Drupal needs.
--base=./
Set the base URL to best resolve relative links.
-nd
Do not create a hierarchy of directories.
-k
Convert links to make them suitable for local viewing.
-P $TARGET
Download into this directory.

Use the -nd option with caution, as name-collisions will bite. It will work without, but you will get a hierarchical site.

The wget-command will append query strings such as ?itok=qRoiFlnG to the filenames for images and javascript. The find-command will recursively remove them.

This will preserve aggregates output using Views, but exposed filters will not work (so make sure there are none).

There are no arguments. For a large site, it will take some time to complete.

$ staticclone.sh

You will now have a directory containing all the files required for a static site, including images, linked files, css and javascript. Move it to a location where Apache may serve it.

Clean up the static site

First inspect one of the HTML files and determine the query string used. Edit fixclone.pl and set this to the right value in the “custom” section and two lines in the “main” section as well. The run it on all the HTML-files.

$ fixclone.pl *.html

This is a custom Perl script developed specifically for NordiChi 2018. Curently it does the following:

  1. Remove query sting from links to .css and .js.
  2. Remove version sting from links to jquery….js.
  3. Delete single lines:
    • Generator (Drupal is no longer the generator)
    • shortlink (absolute URL)
    • application/rss+xml (absolute URL)
    • jQuery.extend (stops fdupes from working)
  4. Delete skip-link <div> (pointless on a static site).

Finally, perform the following tasks:

  1. Remove rss.xml and feed as well as all links to them.
  2. Search for the patterm itok= as it may prevent image link from work. Remove by hand.
  3. Search for absolute URLs to the legacy website and convert to relative by hand. Check that the link works.
  4. Look for duplicate nodes. This will typically be the case when a node has been aliased.
  5. Check if there are any links to the duplicate, and fix to link to the canonical version (there usually are no links).
  6. Remove the duplicate not designated as canonical.
  7. Check if there are any email links and remove them.
  8. Run webcheck to look for remaining problems. You may want to this first with a setting to avoid external links, and then again to check for bad external links.

Here are the CLI commands to use:

$ rm rss.xml feed
$ fgrep 'itok=' *
$ fgrep http://example.net *
$ fdupes .
$ fgrep '"N.html"' *
$ fgrep 'mailto:' *
$ mkdir webcheck; cd webcheck
$ webcheck -a http://example.net
$ rm *
$ webcheck http://example.net

Link to the NordiChi 2018 webcheck report.

Verify that the static version works

Verify that the static HTML version works in a browser. Test to make sure that you properly turned off any interactive elements that will now confuse visitors. Check out that images and linked files work.

Final word

The reason one may want to create a static site archive boils down to one of these:


Last update: 2019-07-27 [gh].