Character sets

This chapter contains miscellaneous notes about locales and character sets, including how to set up utf-8 as the default character set for Apache and how to set up utf8mb4 for Drupal 7. It also has a section about troubleshooting character set problems.

Introduction
Apache
Locale
Setting up a new Drupal site with utf8mb4
Converting an existing Drupal site
Troubleshooting
Email
Final word

Drupal projects discussed in this chapter: UTF8MB4 Convert.

Introduction

The MySQL utf8mb4 encoding (4-Byte UTF-8 Unicode Encoding) is a superset of the MySQL utf8 encoding. It is the recommended character set for Drupal 7.50 and later. See Drupal.org: documentation on adding 4 byte UTF-8 support.

It is not the same as UTF-32, which always uses 4 bytes to store a character. The MySQL utf8mb4 encoding will use between 1 and 4 bytes, depending on the character. Here is a simplyfied summary:

1 byte: ASCII characters.
2 bytes: ISO-8859-X – western european accented character, Greek, Hebrew, etc.
3 bytes: Japanese, Korean and most Chinese characters.
4 bytes: The rest, including some Chinese characters and emojis.

For the details, see diagnosing charset issues.

Apache

The default chacter set for all content served by PHP is handled correctly by Drupal. However, when a response content-type is text/plain or text/html (e.g.: README.txt), the default character depends on the browser settings. To set an explicit default character set in the .conf-file for the host to “utf-8”, use the following directive:

AddDefaultCharset utf-8

However, the exact behavior may be dependent on the user's browser configuration.

This setting disables this functionality:

AddDefaultCharset Off

Locale

In the context of web servers, a locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language identifier and a region identifier.

On the Unix family of operating systems, the format of locale identifiers are similar to IETF language tags, but the locale variant modifier is defined differently, and the character set is included as a part of the identifier. It is defined in this format:

[language[_territory][.codeset][@modifier]]

For example, Norwegoan (bokmål) using the UTF-8 encoding is nb_NO.utf8.

The PHP scripting language has a function (setlocale) that is used to set the locale. If succesful, it returns the new current locale, or FALSE if the locale functionality is not implemented on your platform, the specified locale does not exist or the category name is invalid.

Here is some typical calls. The comment at the end of the line is the locale we expect to be set.

$loc = setlocale(LC_ALL, "nb_NO.utf8");     // UTF-8
$loc = setlocale(LC_ALL, "nb_NO.iso88591"); // ISO-8859-1
$loc = setlocale(LC_ALL, "el_GR");          // ISO-8859-7
$loc = setlocale(LC_ALL, "pl_PL");          // ISO-8859-2
$loc = setlocale(LC_ALL, "pl_PL@euro");     // ISO-8859-15

The last example illustrates the use of the modifier @euro to enforce usage of the ISO-8859-15 character set, which includes a character for the € currency sign, rather than the ISO-8859-2 that is normally used for Eastern European languages that are written in the latin script.

To see the locales currently installed, do:

$ locale -a

If the locale you need to use is not installed

Source SO: How do I get a locale recognized?.

$ sudo dpkg-reconfigure locales

I then get to pick a locale from long list using Spacebar. And it now appears in the OS:

$ locale -a
…
nb_NO.utf8

Setting up a new Drupal site with utf8mb4

To have Drupal 7 use utf8mb4, you may first need to change the database configuration. To learn how to do this, see to the database documentation.

If the database is configured correctly, you need to alter the database connection array (in settings.php) by adding the following keys and values (provided you want the collation used in Denmark and Norway):

$databases['default']['default'] = array(
  …
  'charset' => 'utf8mb4',
  'collation' => 'utf8mb4_danish_ci',
);

If you want to use a more general collation (this will collate "å" with "a"), the following may be used: utf8mb4_general_ci.

note At present (2017-07-22) drush does not allow charset and collating to be set. This means that when installing a new site with drush, you get the old defaults. On bar, I hack this by having a a per-site prepared version of settings.php sitting in /home/gisle/configfiles/. The clean installation script copyies it to the new site before the drush install script is run. Drush will not overwrite it.

Converting an existing Drupal site

If the SQL database has not been configured to enable the the utf8mb4 character set, the Drupal status report shows the following advisory:

4 byte UTF-8 for mysql is disabled.

This means you need to change the database configuration first. To learn how to do this, see to the database documentation.

The following advisory is shown if the database has been configured to enable the the utf8mb4 character set:

4 byte UTF-8 for mysql is not activated, but it is supported on your system. It is recommended that you enable this to allow 4-byte UTF-8 input such as emojis, Asian symbols and mathematical symbols to be stored correctly.

In both cases the advisory also says:

See the documentation on adding 4 byte UTF-8 support for more information.

After making sure that the utf8mb4 character set is supported, you may convert the existing database to utf8mb4. Do do this you may want to install this Drush script: UTF8MB4 Convert by typing:

$ drush @none dl utf8mb4_convert-7.x
Project utf8mb4_convert (7.x-1.0) downloaded to …
Project utf8mb4_convert contains 0 modules: .

The location where the dowloaded file is stored is the directory “~/.drush/utf8mb4_convert/”.

Then navigate below the webroot of the site you want to convert, and type:

$ drush utf8mb4-convert-databases --collation=utf8mb4_danish_ci

Note: The script is not fast, so put a production site in maintenance mode before starting it.

Note 2: On Ubuntu 16.04 (PHP 7.0.33) it seems to crash when converting the tables for the Search module. Unistall it before converting, then reistall afterwards. This is not a problem on Ubuntu 20.04.

[As far as I am able to tell, rerunning the script on an already converted database (e.g. to change the collation, but not the charset) does not break anything, but this has not been extensively tested.]

Finally, alter the database connection array (in settings.php) as described in the section about setting up a new site.

Troubleshooting

Below is what I've found to be best practices to convert files and databases.

The HTTP response header

Make sure that the correct charset is indicated in the HTTP response header. The charset of he page is usually indicated in the HTTP response header like this:

Content-Type: text/html; charset=utf-8

To examine the HTTP response header in Firefox, navigate to Web Developer Tools » Network. Click on the Reload button. Click on the row of the page loaded to select it. On the right side of the screen, select the “Headers” tab.

There are also online tools that let you examine the HTTP response header, for example: webconfs.com.

There is isually no need to do this, but just in case: Setting the HTTP charset parameter.

Converting HTML files with emacs

When non-ASCII characters in a HTML-file shows up as single question marks on a black diamond, or replaced by two or more non-ASCII characters, the web server is confused about the character set used. Here is an example:

Wilhelm Röntgen
Wilhelm R�ntgen
Wilhelm RÃ¶ntgen

The first line in the example sbove shows the German name of the discoverer of X-rays correctly. The second line is what you will see if iso-8859-1-encoded text is rendered as utf-8. The third line is what you will see if utf-8-encoded text is rendered as iso-8859-1.

You may use this command to check the character set encoding of one or more files:

$ file -i example.html
example.html: text/html; charset=iso-8859-1

If the character set attribute is set correctly, but the rendering is wrong because the web server ignores the attribute because it is configured to always render utf-8 (this is the default on my sites), the quickest fix is to use emacs to edit the file, and change the charset attribute of the file to utf-8. I.e. change the following line in the header from:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

To:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And save the file.

Then emacs will automagically also recode the file's characters to use utf-8-encoding.

Converting text files with iconv

The CLI utilty iconv to convert betwen different character encodings, based upon Unicode conversion.

Below are three examples of using it. The first converts from Windows-1252 (aka. CP-1252) encoding to UTF-8. The second converts from UTF-8 to plain US-ASCII while transliterating those that do not exist in US-ASCII. The third converts from ISO-8859-1 (aka. Latin1) to UTF-8, ignoring all errors.

$ iconv -f WINDOWS-1252 -t UTF-8  cp1252.txt > utf8.txt
$ iconv -f UTF-8 -t US-ASCII//TRANSLIT utf8.txt > ascii.txt
$ iconv -f ISO-8859-1 -t UTF-8//IGNORE latin1.txt > utf8.txt

Database configuration

Modern databases such as MySQL can handle multiple character sets and has powerful capabilities to convert between them. However, this may create a great deal of confusion as it not always obvious how MySQL handles this, leading to the dreaded “double conversion” where perfectly valid multi-byte utf-8 encoded text is converted byte by byte from “latin-1” to utf-8. The best way out of this madness is to use utf-8 on all levels.

Here is some commands that may help figuring out what is going on. They can be entered through the SQL tab in phpMyAdmin or the mysql shell interface.

To find out what character sets are set in your configuration, do:

mysql> show variables like 'char%'; 
+--------------------------+----------------------------+ 
| Variable_name            | Value                      | 
+--------------------------+----------------------------+ 
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+ 
8 rows in set (0.00 sec)

You don't want to see latin1 mentioned in there. If you do, check the configuration. You can change the value of character_set_database with:

alter database DBNAME charset=utf8;

Importing legacy SQL data

Importing legacy data into a database that uses some version of utf8 may lead to encoding hell. The reason is that MySQL:

has a default per schema (database) charset and collation
has a per table charset and collation
has a per text column charset and collation

To see what the character sets are, read this answer at SE.

If any of these are out of sync with what is actually stored in the database, things tend to go very wrong.

note Use the CLI mysql client, custom PHP programs or CLI tools like pep, enca, od or hd for examining the data. Avoid using phpMyAdmin or any elaborate client for this, since these tools may be clever enough to hide the problem.

Trying to fix these at the source may make matters worse, as explained in Whitesmith.co: MySQL encoding hell.

First, make sure that the schema character set is utf-8.

To see default for your schemas (databases):

SELECT * FROM information_schema.SCHEMATA;

To determine how non-ASCII characters actually are encoded in the SQL-dump. For this purpose pep is handy.

$ pep -x -b dump.sql > dump2.txt

Examine the dump to determine the character set used. The table below show the hex values that will show up in the expanded output for some punctation characters, two special characters, the three extra Norwegian characters and one emoji in four different encodings.

Ch	latin1	windows-1252	utf8	utf8mb4
…	`NA`	`0x85h`	`-`	`0xe2h 0x80h 0xa6h`
‘	`NA`	`0x91h`	`-`	`0xe2h 0x80h 0x98h`
’	`NA`	`0x92h`	`-`	`0xe2h 0x80h 0x99h`
“	`NA`	`0x93h`	`-`	`0xe2h 0x80h 0x9ch`
”	`NA`	`0x94h`	`-`	`0xe2h 0x80h 0x9dh`
–	`NA`	`0x96h`	`0xe2h 0x80h 0x93h`	`0xe2h 0x80h 0x93h`
	`0xa0h`	`0xa0h`	`0xc2h 0xa0h`	`0xc2h 0xa0h`
	`0xadh`	`0xadh`	`0xc2h 0xadh`	`0xc2h 0xadh`
æ	`0xe6h`	`0xe6h`	`0xc3h 0xa6h`	`0xc3h 0xa6h`
ø	`0xf8h`	`0xf8h`	`0xc3h 0xb8h`	`0xc3h 0xb8h`
å	`0xe5h`	`0xe5h`	`0xc3h 0xa5h`	`0xc3h 0xa5h`
😁	NA	NA	`NA`	`0xf0h 0x9fh 0x98h 0x81h`

If pep is not available od can be used:

$ od -c dump.sql

If the dump contains 4 byte values for Norwegian and western european accented letters, the problem is probably that the data is utf8, but mysqldump thinks it is latin1 and converts it.

To fix this, run mysqldump with the following two flags: --skip-set-charset and --default-character-set=latin1. This prevents reconversion and setting a charset when creating the dump:

$ mysqldump -u username -p \
  --skip-set-charset --default-character-set=latin1 \
  database > dump.sql

Then remove and replace the erronous information from the dump. I use the following sed-command:

sed -e "s;CHARSET=latin1;CHARSET=utf8;g" \
    -e "s; COLLATE latin1_danish_ci ; ;g" \
    -e "s;latin1_danish_ci;utf8_danish_ci;g" < dump.sql > fixeddump.sql

The file fixeddump.sql should now import correctly.

Useful links:: Table: UTF-8 encoding table and Unicode characters
How to convert a MySQL database to UTF-8 encoding
SO: Change MySQL default character set to UTF-8 in my.cnf