Content inventory

by Gisle Hannemyr

This chapter first defines the term “content type” from the perspective of an information architect and a site builder. It then goes on to describe how to create a content inventory in order to identify the content types that shall make up the website, and how to make sure those content types are supported by the website's software. By means of example, this chapter is about conducting a content survey in order to create a Drupal version 10 website. However, the more general principles may be adapted for any WCMS platform.

Introduction
Content types
- Predefined content types in Drupal
- A custom content type
Separating content from clutter
Doing a content survey
- Low fidelity content survey
- High fidelity content survey
Content audit
- ROT-analysis
- Automatic tools
Final word

Introduction

A content inventory is an inventory of content that exists on a website. It is usually registered in some tabular format (e.g. in a spreadsheet, or in a suitable database).

There are two main types of content inventory:

Content survey: Identifying content types that need to be supported (or not) in the website.
Content audit: A review of all website content in order to be able refactor it.

In this chapter, we distinguish between design and redesign. Design is a process that is carried out when no previous version of the website exists, and an outline of what types of content it shall contain must be created. Redesign is when a website is converted from some existing platform to another platform. In many cases, the starting point for a redesign may be static HTML, another WCMS (e.g. WordPress) or an legacy version of the target WCMS (e.g. Drupal version 7).

A content inventory created for the purposes of doing a content survey will only analyse pages that contain representative content types.

A content inventory created for the purposes of doing a content audit will often include all pages of an existing website. It is usually at least partially created by means of automatic tools.

Content types

To understand the purpose of doing content survey as part of the process of building a Drupal website, you need to understand the concept of content types in the context of a web content management system (WCMS) in general, and in the context of Drupal version 8 in particular.

A WCMS is a computer system that allows an organisation or a group of authors to manage and present contents on a website.

The infamous three circles of information architecture.

From the perspective of an information architect, content is structured data objects. From the perspective of a site builder, structured data objects must be defined in a way that can be represented internally in the website's database. In order to reconciliate these two perspectives, we use content types.

A content type is a pre-defined collection of data types that relate to each other by being components of “something” that from a content creator perspective should be considered as a correlated whole.

In Drupal, the term entity is used to refer to a structured data object that is managed by the WCMS. The particular collection of data types that make up an entity is called a bundle, and the container for a single data type in the bundle is called a field. The main entity used to hold content in Drupal is simply called Content.

note The content entity is Drupal's main container for content. Since “content” is a generic term, and the content entity is stored in the {node} base table in the database, the term “node content type” is sometimes used to refer to types of the content entity in order to distuingish it from other types of entities, such as users, comments, taxonomy terms and files. However, in the Drupal community, when you see the term “content type” it always means “node content type” (i.e. a type of the content entity – not a type of some other entity). Also, in the Drupal community, the term “node” always means “instance of node content type”. However, when surveying content, an information architect may use the term “content type” to refer to managed content that is not a “node content type”. For instance a “user” (i.e. a user profile) may also be viewed as a content type by an information architect.

When a content creator creates content for a website, he or she do so by creating an instance of content of a particular content type.

When content are presented to users, it may be in a format to corresponds closely to what the content creator create. However, a WCMS lets the site builder arrange for an infinte number of other ways to present content (e.g. aggregate lists, or the content of an individual field may be extracted from one content type joined into another).

Not everything that is visible on a web page is a content type. When doing content survey to identify the content types on a particular website, the information architect must use discretion to distinguish between instances of content types and auxilliary content.

This list of things that are not instances of content types may be handy until you have learnt from experience how to identify instances of content types:

A website home page or a subsite foundation page is not by itself an instance of a content type (but may contain one or more instances of content types).
A form (e.g. a login form, or the form used by the content creator to create content – but the content that results from filling in a form to create content is an instance of a content type, and the fields in the form will tell the information architect the data types that consitute that particular content type).
Navigation elements (e.g. a menu, global, local, contextual, integrated, footer navigation, supplemental navigation, search field).
An aggregation (but aggregations are frequently lists of instances of content types – “blog” is not a content type, but each individual “blog entry” is an instance if the content type “blog entry”).
Regions (but a region may be a container for one or more instances of a content type), but more frequently, a region will contain blocks, A block is not a content type.
Decorative elements that are placed on the page by the page's template (i.e. a part of theming), but not managed as content in the database of the WCMS.

note You also should not do a content survey of teasers. A teaser is an instance of a content type that only render an extract or a summary of a content type instance with a link to an expanded rendering of the same content type instance. Do not bother making a content survey of the teaser. Click through the teaser to the expanded rendering, and do a survey of that. Most teasers are part of an aggregation and is contextual navigation. Both these criteria are grounds for excluding teasers from the content survey.

If there is only a single instance of “something” on a website, it makes no sense for an information architect to create a content type for it. It is much more simple to create singular content as a block or as a decorative element. If you are only able to find a single instance “something” on a website, it is very unlikely that this an instance of a content type.

Predefined content types in Drupal

Drupal core comes with some predefined content types.

Predefined content types in Drupal version 7

Six different node content types are predefined by the Drupal version 7 core. These are listed below, along with a comment that mentions some of their characteristics. Note that the five first are very similar, with long text as their main field. What makes them different are the default settings and what aggregation structure they by default belong to.

Article
: Main field: long text. Default settings: byline/date visible, comments open, promoted to front page. Aggregation: Main content.
Basic page: Main field: long text. Default settings: byline/date hidden, comments closed, not promoted to front page. Aggregation: None.
Blog entry: Main field: long text. Default settings: byline/date visible, comments open, promoted to front page. Aggregation: blog.
Book: Main field: long text. Default settings: byline/date visible, comments open. Aggregation: book.
Forum topic: Main field: long text. Default settings: byline/date visible, comments open. Aggregation: forum.
Poll: This is different from all the other content types. It allows your to create multiple choise questions and to capture votes on these.

Of those six, only Article and Basic page are enabled by default. When doing content survey of a website, expect to come across these two content types a lot. If a content item mainly consists of text with inline images, without a visible byline and date, it is very likely that it is a Basic page. If it carries a byline and date, it is very likely that it is an Article.

tip Long text is a text field that consists of several lines of text, and may also contain embedded markup for inline images and inline (integrated) links. Just text is a field that consists of a single line of text without embedded markup.

Predefined content types in Drupal version 10

In Drupal 10, there are only three node content types:

Article
Basic page
Book

The three other node content types that were available in Drupal 7 are provided by extensions:

A custom content type

When doing content survey, expect to come across the content types Article and Basic page frequently.

One of the tasks of a site builder is to create custom content types that uses specific fields to enrich the information architecture of the site. By breaking out specific data types from a catch-all long text field, the informating architect can make content more findable by providing faceted search for the site, and use specific fields to create contextual navigation links.

The screenshot below show an instance of a content type that is taken from the website of a community of practice. This community is interested in keeping track of events (workshops, conferences) and publications (journals, monographs) where members of the community may want to participate and contribute. This particular content instance instance is about a conference named “NordiCHI 2018”. The date when papers are due 15th of April 2018, and the conference is scheduled to take place from October 1st to October 3rd the same year. The venue location is Oslo in Norway, and there is a link (shown in blue) to the conference website. There is also some text describing the conference.

The fields that make up this instance of a custom content type.

The screenshot above shows a content instance, and an information architect has (with red pen) deconstructed all specific fields that make up the bundle of this particular content type.

tip Deconstruction of the fields that make up the content is done when you are doing a high fidelity content survey. Early in the design process, you will probably only do a low fidelity content survey, and you will not have to break down the content type in individual fields.

The deconstruction tells the information architect that the bundle for this particular content type consists of the following six fields:

Name of conference, monograph, etc.
Paper due date
From-to dates
Location
Website
Description

The information architect names this content type “Call for Contributions” and creates a sketch that roughly shows all the fields that make up its bundle:

Sketch of the fields in this custom content type.

Given that we want to create a website for another community of practice that provides its members with a similar content type, it is now fairly simple for the site builder to “translate” this sketch into an actual content type on the Drupal website that is being built for this community of practice by means of Drupal's built-in function to create content types (see the chapter with the title Creating content types using fields for a description of this). The result of the site builders work will look like this when the newly created content type is examined in Drupal's administrative GUI:

The fields as they appear in the Drupal GUI.

Notice that the column with the heading “Field type” assigns a specific data type to the field. The colum with the heading “Widget” specify what user interaction widget to use when the content creator creates new content instances of the content type.

Separating content from clutter

When starting out surveying content, it is easy to be confused by the clutter of content that appears on most web pages. The ability to separate the content type instance from the clutter comes with experience.

In the example below, two different pages taken from the website of “Netfonds Bank” is surveyed. The first of those pages is shown below:

A web page that is simple to survey.

Even an inexperienced information architect should manage to call this one correctly. They know that the navigation at the top and bottom of the page should be excluded from the content. What remains is a long text field. I have highlighted this contents by putting a red frame around it. The long text does not have a visible byline or date, so it should be obvious that the content type of the web page above is Basic page.

The second page in this example is much more cluttered. When surveying its contents, you know that you should exclude the navigation at the top of bottom when doing content survey – but what about the image of the girl and the speech bubble? What about the other “stuff”, such as the links to other geographical regions and links to most traded shares (in the left sidebar), the graphs and the key numbers (“nøkkeltall”) in the right sidebar?

Cluttered web page whose content is of the type basic page.

In fact, all the stuff in the two sidebars as well as the speech bubble that extends over the first sidebar and the main content region are not “content” in the context of content survey:

The girl and the speech bubble are purely decorative. It is not managed content and should not be included in content survey.
The links to other geographical regions is contextual navigation.
The links to most traded shares is contextual navigation.
All the three graphs that appear on the page are teasers. You may survey the instances of content types the teasers link to, but you should not include the teasers in the survey.
The block with key numbers is an aggregation.

What remains is the part of the page is an area filled with long text. It carries no byline or date, so we recognize the content type as an instance of your old friend Basic page. I have highlighted it by putting a red box around it.

Aggregates

Some web pages simply displays an aggregate where several instances of a content content type are displayed.

For instance, the main content field of the web page shown below is an aggregate that shows teasers linking to all the players on Machester United's first team:

The main content region gives a view of an aggregate of all the players on the first team.

Since the page is made up of global navigation (header region) local navigation (left sidebar). and an aggregate (main content region), there is no content to survey on the web page shown above.

The teaser showing an image of the player, his jersey number and his last name is a link, linking to the profile of the individual player. This means that this page also provides contextual navigation to help visitors navigate to individual player profiles. When you click through to an individual player profile, you see a page like this:

An individual player profile.

On this page the main content area holds an instance of a content type which we can name “player profile”. The information architect is now able to deconstruct the fields that constitute the player profile content type. There are:

Given name (text)
Team-class (taxonomy term)
Jersey number (integer)
Birthdate (date)
Birthplace (text)
Position (taxonomy term)
Joined United (date)
Joined From (taxonomy term)
International (taxonomy term)
United Debut (text)
Appearances (integer)
Goals Scored (integer)
Biography (long text)

The team-class (the taxonomy term “first team”) does not appear as a field in the player's profile, but it must be part of the player bundle to let the site builder create the view of all players of the team-class “first team” on the aggregate page show above.

Note that four of the fields are of the data type “taxonomy term” rather than “text”. This is because the site builder may want to use these to create contextual navigation. We've already seen that by making “Team-class” a taxonomy term, the site bilder is able to create a page to view all players on a team and provide contextual navigation to individual player profiles. By making “Position” a taxonomy term, the site builder can provide a page to view all players that play at a particular position in this sports club. By making “International” a taxonomy term, the site builder can provide a page to view all players of a particular nationality.

When deconstructing fields, always think about whether it will be useful to use a taxonomy term as data type for a text or image field.

Also, as an information architect, you should understand that the links to the right (“Order your De Gea shirt”, “Read exclusive player interviews”, “Browse player photo galleries”, “Download free united wallpapers”) and the Adidas advert is not part of the player profile. The links provide contextual navigation, the advert is just decoration..

In the example above, it is obvious that the aggregate player page is not a content type because it lets visitors navigate to individual player profiles. So the main content area of the aggregate page is made up entirely of teasers that are used for contextual navigation. You've already learnt that neither a teaser nor navigation should be surveyed, so it should be obvious that there is no point in including this particular page in a content survey.

Now, look at a less clear cut case. Should the web page below, showing a list of fixtures and results, be surveyed?

Aggregate of fixtures and results.

The main content region lists four fixtures. There is a link associated with each fixture. It goes to the foundation page for match reports for a fixture in the past, and links to the online ticket office if the match is in the future. But neither of those links leads to something that can be regarded as a content type from the perspective of an information architect. So unlike the previous example, where following links took you to a player profile page which you could deconstruct as a content type, this page offers no similar link. I.e. there are no teasers on this web page.

However, you should recognize this page as an aggregate of an underlying content type, named “fixture”. Below is how an information architect would deconstruct this type.

Matchdate (date)
League (taxonomy term)
Home team (taxonomy term)
Away team (taxonomy term)
Venue (taxonomy term: Home, Away, Neutral)
Result (text)
Kickoff (time)
Live TV provider (taxonomy term)
Match report (link)
Buy ticket (link)

There are a few things to note here:

First, note that the information architect has choosen to use a taxonomy term to designate both the home team and the away team. By making this choice, the site builder can use the taxonomy term to pull the team's logo from a logo database, add the textual representation for the team's name, and render the field showing the logos and the text. This saves the content editor from having to create a juxtaposition of the logos and the text by writing HTML-markup.

Secondly, note that two of the columns in the aggregate shows different fields depending upon contect (i.e. whether the fixture is in the past or in the future). For fixtures in the past, the fifth column shows the result and the last column links to matchreports. For fixtures in the future, the fifth column shows the kickoff time, and the last column links to the ticket office.

These two notes explains why the information architect is able to deconstruct ten separate fields from seven columns.

Doing a content survey

The purpose of doing a content survey is to identify the content types that need to be supported on the website you are designing, redesigning or upgrading.

This content survey can be low fidelity or high fidelity. Both involve going on a mind-boggling detailed odyssey through the website and/or physical documents you are surveying. A low fidelity content survey only identifies the content types. A high fidelity content survey deconstructs the individual fields that constitutes the content type's bundle.

The process of creating a content survey, in the case of a redesign project, is the relatively straightforward process of clicking through the legacy website and recording what you find in a simple spreadsheet.

If you are designing a new website, or if the legacy website is very sparse and do not have the content your client want to put on the new website, your content survey need to take a slightly different route:

Find one or more websites in the same genre as the website you are designing, and use that as the starting point for your content survey. This will usually give you some idea about the types of content that will be needed for the new website.
You may also survey physical content from the organisation you are designing a new website for: Paper newsletters, annual reports, white papers, brochures, product sheets, etc.

To do content survey for a website, start at the home page. Identify the major sections of your site, and dip down into those sections. See what's linked from it. At first, you will probably just see subsite foundation pages and aggregates, and very few actual instances of content types (see the list above to learn how to identify “stuff” you may come across when you click through a website that are not instances of content types).

For each page that you visit were you are able to identify distinguished instances of content types, make a note of the characteristics of the content type in the spreadsheet you use to record your content survey. Follow links and navigate through the website until you have a fairly good overview of what content types are used to populate the site.

Doing a complete walkthrough of a complete website is usually not practical (unless the website is very sparse). For the purposes of content surveying find, examine, analyse and record a representative sample of the system's content (e,g two of each kind). This is sometimes referred to as the “Noah's ark” approach. As for how many pages to analyse, and how much time to spend, there are no clear rules. You need to use some intuition and judgment, balancing the size of your sample against the time and budget constraints of the project.

Links to a spreadshet with content inventory templates:

MS Excel (17 Kbyte Excel file).

Low fidelity content survey

When an information architect is designing, redesigning or upgrading a website, he or she needs to tell the site builder what content types the site is going to need. A low fidelity content survey is a design document (typically a spreadsheet) that represents the first step towards defining the content types the website will require.

Example spreadsheet for a low fidelity content survey:

Spreadsheet for a low fidelity content survey.

Here's a description of the things to consider putting in the inventory where you create a content survey:

Page title: The content you are evaluating needs to be called something. If it is meaningfull, you usually just use the title of the HTML document (from the <title>-tag inside the <head>). If that's not specific enough, the use the page headline (usually inside a <h1>-tag) from the content. If neither is meaningful, you create a new title (and make a note in Comments). Make sure the title you put in the content inventory is unique and descriptive.
Content type: What unique content bundle of fields does the page use, or which should it use? Is it a product page, or a legal brief, or a press release? Every site will have different types of field bundles, but most have fewer than a couple dozen. Give the content type a meaningful name, to the best of your ability.
Comments: Anything else you want the site builder to know about this content type.
Source: If you are inventoring a publicly available web page, record the URL of the piece of content you're looking at. This let you and anyone reviewing your inventory get directly back to the source web page from the spreadsheet. If the web page is not publicly available, your design document need to show what you've inventories by means of a screenshot. If you are inventoring a physical document, write down the source in plain English (e.g. “ABC Newsletter, 2018-03-09, pp. 42-43”). To make the document available to a reviewer, include a facsimile of it in the design documentation.

High fidelity content survey

A high fidelity content survey is deconstructing the fields that make up the bundle of each content type surveyed.

Example spreadsheet for a high fidelity content survey:

Spreadsheet for a high fidelity content survey.

In row 1, record:
- Source: If you are inventoring a publicly available web page, record the URL of the piece of content you're looking at. This let you and anyone reviewing your inventory get directly back to the source web page from the spreadsheet. If the web page is not publicly available, your design document need to show what you've inventories by means of a screenshot. If you are inventoring a physical document, write down the source in plain English (e.g. “ABC Newsletter, 2018-03-09, pp. 42-43”). To make the document available to a reviewer, include a facsimile of it in the design documentation.
In row 2, record the headings for the colums below row 2.
In row 3 and below, record your content survey data. Put the content type in the first column, and enumerate its fields in the second column, To improve readability separate one data type from the next. with an empty row. Record at least the following data:
- Content type: What unique content bundle of fields does the page use, or which should it use? Is it a product page, or a legal brief, or a press release? Every site will have different types of field bundles, but most have fewer than a couple dozen. Give the content type a meaningful name, to the best of your ability.
- Fields: A content type is made up of fields. Record all the fields you think goes into this content type's bundle.
- Datatype: A field is a container for a single data type in the bundle. Write down the data type you think the field in the row immedeately is.
- Comments: Anything else you want the site builder to know about this content type.

Content audit

A content audit is only done when some website already exists. The purpose is threefold:

To have complete inventory of all existing content to make sure that all useful content is migrated.
To prune out from migration content that is redundant, outdated or trivial (ROT-analysis).
To refactor the migrated content, to improve the information architecture of the website.

Similar to a content survey, a content audit is conducted by inventoring content in a spreadsheet or database.

ROT-analysis

Identifying ROT is an essential part of content audit. It helps spot obvious content problems. When creating the spreadsheet to identify page titles, links, content types, keywords and other facts about your content (nodes), add a column for ROT in your content audit spreadsheet or database.

ROT are, for instance:

Lower-level pages that repeat content that is already presented on welcome and overview pages.
Outdated news or events that are represented as upcoming, but are in the past.
Useless information and unrelated links.
Broken links and missing content.
Mislabeled headers and page titles.
Missing or duplicate meta page descriptions and keywords.
Outdated contact information.

Fix and prune ROT before migrating the website.

Source: MeetContent.com.

Automatic tools

There exists automatic tools that will crawl a website and report the URL of every page that make up the site. See, for example:

Arthur de Jong: webcheck
Tilman Hausherr: Xenu's_Link_Sleuth
Leandro H. Fernández: DRKSpider
Screaming frog: SEO spider

This is usually what you want to start with if you are conduction a content audit. Using an automatic crawler will ensure that you don't miss any pages. After the site has been crawled, transfer the result to a spreadsheet or database for analysis.

You may also use one of these tools to generate a list of URLs and use that as a starting point for the content survey, rather than embarking on a manual oddyssey of the website. However, as you've already discovered, not all pages on a website are instances of content types. If you are doing a content survey: To translate the result of an automated crawl into a content survey you must go through the pages found by the automated tool by hand, and identify those that contain instances of content types.

Final word

Surveying content is a human task. In fact, you find that the process can often be as valuable as the final spreadsheet. If you invest the time in analyzing the website and deconstructing each pages into the content inventory (or at least an inventory of a representative selection of pages – i.e. the “Noah's ark” approach), you will gain invaluable insight into how it all goes together. That's important knowledge to possess when designing, redesigning or upgrading a website.

Last update: 2023-01-15 [gh].