Search engines

by Gisle Hannemyr

This chapter shows some basic organic SEO techniques, as well as outlining some other interactions between the webmaster and search engines.

Table of contents

Drupal projects discussed in this chapter: Node noindex.

Introduction

To use most of the operations below, you must already have registered a Google account and registered the URLs you want to manage with Google webmaster's tools. this is a Google-provided cloud service that allows you to analyze several features of your website.

It provides some of the functionality of Google analytics, but extracts data from search, not from embedded JavaScript tracking.

For example, it lets you examine:

See alsoSee also Top 6 SEO modules for Drupal 8, Google structured data testing tool, SEO basics.

Organic SEO

[This section is from 2014, and could use some love.]

Site map

Generate a site-map and upload it to the root directory of the site. Use Google webmaster's tools to inform Google about it existence.

Submitting a sitemap does not make Google crawl your site immediately.

Metadata

You should use a unique and relevant title and description meta tag on every page you want indexed.

The page title is the single most important on-page SEO factor. It's rare to rank highly for a primary term (2-3 words) without that term being part of the page title.

The meta description tag won't help you rank, but it will often appear as the text snippet below your listing, so it should include the relevant keyword(s) and be written so as to encourage searchers to click on your listing.

<title>Website title</title>
<meta name="description" content="This text may be used by Google to generate
the snippet displayed in the SERP.">

You may ignore the keywords meta tag, as no major search engine today cares about it.

Directories

Submit foundational links to respected directories. Don't waste time worrying about directory submissions. Submit and forget.

The canonical directory, DMOZ, shut down on March 17, 2017. Here are some alternatives.

Seek links from authority sites in your industry. If local search matters to you (more on that coming up), seek links from trusted sites in your geographic area the Chamber of Commerce, local business directories, etc. Analyze the inbound links to your competitors to find links you can acquire, too.

Create great content on a consistent basis and use social media to build awareness and links.

Remove or alter a page indexed by Google

When Google has indexed a page, it sticks around for some time.

To remove or alter a page indexed by Google, visit Google's Webmasters tool and sign in. You should now see the “Search Console” and a list of “properties” (websites that you've registred with Google).

Fetch as Google

To check how Google perceives a web page.

  1. Click on the URL of the resource you want to manage.
  2. Expand Crawl » Fetch as Google. You will now see a list of previous fetch request and their timestamp. This is the copy of the page that is in Google's cache.
  3. Fill in the URL of the page you want to examine, and press “Fetch and render”. This results in Google fetching the page again. This only take a few minutes.
  4. Press “Request indexing”, confirm that you are not a robot, and select “Crawl only this URL”. The request will appear, along with the timestamp.

The result of a request does not immediately affect the SERP (2 hours was not enough), but I observed a changed SERP after five hours.

Force recrawling

To try to force recrawling of a page after changing content, use this URL: www.google.com/webmasters/tools/removals.

Google will ask that you give a word no longer appearing on the page. However, it does not

Request made 2018-05-09 6:35. 8:13 (pending)

Temporarely hide page

To temporarely hide a page that is part of a site you've registered with Google's Webmasters tool, use the following procedure when signed into Google's Webmasters tool:

  1. Click on the URL of the resource you want to manage.
  2. Expand Google Index » Remove URLs.
  3. Expand Temporarely hide and enter the URL on your site that you'd like to hide.

An URL linking directly to the temporary URL removal tool is: www.google.com/webmasters/tools/url-removal.

This is only a temporary measure and will expire after a short time (about 7 days). However, the page is not automatically recrawled and if it is still online, it may reappear with old content.

Permanently hide page

To permanently remove the page from Google's search index, delete the page, put it behind a login wall, or put the robots meta tag in its <head>, setting it to noindex:

<meta name="robots" content="noindex" />

However, the Drupal 7 core automatically adds the canonical relation to every page. Example:

<link rel="canonical" href="/alias" />

Experiments shows that the canonical relation makes Google ignore the instruction to not index the page, re: Do Not Noindex Pages With Rel Canonical Tags.

To prevent the page from being indexed, you may use the Drupal 7 Node noindex project. It provide a checkbox where the administrator can request noindex for any node. Checking the box will insert an appropriate robots metatag and remove the canonical relation from the markup.

noteDo not use robots.txt to prevent prominent pages from being indexed. You may use robots.txt to prevent crawling of all or parts of your site, but search engines will still index a page if it is linked to from somewhere else. In fact robots.txt may hide the robots metatag (described above) from robots, which means it will be ignored.

Identify and block bad robots

Drupal and WordPress projects:

Blackhole by Jeff Starr (aka Perishable)

Blackhole is a PHP script that automatically trap and block bad bots that do not obey robots.txt rules. It is available as a WordPress plugin and as a standalone PHP library. This note discusses setting up the standalone library and its adaption to Drupal.

Its license is “GPLv2 or later”, so it is compatible with Drupal.

Do download the .zip containing the library, you may have to use the Chrome or Iridium browser.

The blackhole lives in a subdirectory of your site. In the original version, this must be named blackhole. In the Drupal version, the name can be set by the user.

The subdirectory contains the following two files:

The principle is this: When a bad bot loads index.php in the blackhole subdirectory, it get logged. The log is used to ban it on subsequent visits.

Here are the usage instructions (adapted from the original). In these instructions, the subdirectory name “blackhole” is used. You shold change these to use your own subdirectory name.

  1. Add the blackhole-subdirectory to the root directory of your site and copy the two files into it.
  2. Open index.php in the blackhole subdirectory and edit the variables in the "EDIT HERE" section. You may alter the name of the blackhole subdirectory, and the settings for email alerts about bad bots.
  3. Change file permissions for blackhole.dat to make it writable by the server.
  4. Add a line similar to the one below to the beginning of all pages. This will ban bad bots using PHP. Leave out this line to just log, not ban.
<?php include(realpath(getenv('DOCUMENT_ROOT')) . '/blackhole/index.php'); ?>
  1. Add a line similar to this to the footer of some or all pages. It will give the bad robot an invisible link to follow (there are other methods for communication this URL to a bad robot).

Do not go <a href="/blackhole/" rel="nofollow" style="display:none;">here!</a>
  1. Add these lines to your site's robots.txt file:
User-agent: *
Disallow: /blackhole/

Testing:

  1. Test your robots.txt for proper syntax (for example, you can use the robots checker in Google Webmaster Tools).
  2. Visit the link from step 5, and then try visiting other pages on your site.

Tip: To reset the Blackhole list, clear the contents of the blackhole.dat file.

The function blackhole_whitelist() contains a hardwired whitelist. Default: aolbuild baidu bingbot bingpreview msnbot duckduckgo adsbot-google googlebot mediapartners-google teoma slurp yandex. Disabled for now, but may be reintroduced as a user setting in a production version.

My adaptions:

  1. Deleted MacOS junk from the archive.
  2. Made the blackhole subdirectory name configurable by the user.
  3. Disabled whitelist.

To do:

  1. Determine if nofollow relation is useful or not.
  2. Fix email alerts.

Final word

[TBA]


Last update: 2018-04-21 [gh].