Character sets

This chapter contains miscellaneous notes about locales and character sets, including how to set up utf-8 as the default character set for Apache and how to set up utf8mb4 for Drupal 7. It also has a section about troubleshooting character set problems.

Table of contents

Drupal projects discussed in this chapter: UTF8MB4 Convert.

Introduction

The MySQL utf8mb4 encoding (4-Byte UTF-8 Unicode Encoding) is a superset of the MySQL utf8 encoding. It is the recommended character set for Drupal 7.50 and later. See Drupal.org: documentation on adding 4 byte UTF-8 support.

It is not the same as UTF-32, which always uses 4 bytes to store a character. The MySQL utf8mb4 encoding will use between 1 and 4 bytes, depending on the character. Here is a simplyfied summary:

For the details, see diagnosing charset issues.

Apache

The default chacter set for all content served by PHP is handled correctly by Drupal. However, when a response content-type is text/plain or text/html (e.g.: README.txt), the default character depends on the browser settings. To set an explicit default character set in the .conf-file for the host to “utf-8”, use the following directive:

AddDefaultCharset utf-8

However, the exact behavior may be dependent on the user's browser configuration.

This setting disables this functionality:

AddDefaultCharset Off

Locale

In the context of web servers, a locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language identifier and a region identifier.

On the Unix family of operating systems, the format of locale identifiers are similar to IETF language tags, but the locale variant modifier is defined differently, and the character set is included as a part of the identifier. It is defined in this format:

[language[_territory][.codeset][@modifier]]

For example, Norwegoan (bokmål) using the UTF-8 encoding is nb_NO.utf8.

The PHP scripting language has a function (setlocale) that is used to set the locale. If succesful, it returns the new current locale, or FALSE if the locale functionality is not implemented on your platform, the specified locale does not exist or the category name is invalid.

Here is some typical calls. The comment at the end of the line is the locale we expect to be set.

$loc = setlocale(LC_ALL, "nb_NO.utf8");     // UTF-8
$loc = setlocale(LC_ALL, "nb_NO.iso88591"); // ISO-8859-1
$loc = setlocale(LC_ALL, "el_GR");          // ISO-8859-7
$loc = setlocale(LC_ALL, "pl_PL");          // ISO-8859-2
$loc = setlocale(LC_ALL, "pl_PL@euro");     // ISO-8859-15

The last example illustrates the use of the modifier @euro to enforce usage of the ISO-8859-15 character set, which includes a character for the € currency sign, rather than the ISO-8859-2 that is normally used for Eastern European languages that are written in the latin script.

To see the locales currently installed, do:

$ locale -a

If the locale you need to use is not installed

Source SO: How do I get a locale recognized?.

$ sudo dpkg-reconfigure locales

I then get to pick a locale from long list using Spacebar. And it now appears in the OS:

$ locale -a
…
nb_NO.utf8

Setting up a new Drupal site with utf8mb4

To have Drupal 7 use utf8mb4, you may first need to change the database configuration. To learn how to do this, see to the database documentation.

If the database is configured correctly, you need to alter the database connection array (in settings.php) by adding the following keys and values (provided you want the collation used in Denmark and Norway):

$databases['default']['default'] = array(
  …
  'charset' => 'utf8mb4',
  'collation' => 'utf8mb4_danish_ci',
);

If you want to use a more general collation (this will collate "å" with "a"), the following may be used: utf8mb4_general_ci.

noteAt present (2017-07-22) drush does not allow charset and collating to be set. This means that when installing a new site with drush, you get the old defaults. On bar, I hack this by having a a per-site prepared version of settings.php sitting in /home/gisle/configfiles/. The clean installation script copyies it to the new site before the drush install script is run. Drush will not overwrite it.

Converting an existing Drupal site

If the SQL database has not been configured to enable the the utf8mb4 character set, the Drupal status report shows the following advisory:

4 byte UTF-8 for mysql is disabled.

This means you need to change the database configuration first. To learn how to do this, see to the database documentation.

The following advisory is shown if the database has been configured to enable the the utf8mb4 character set:

4 byte UTF-8 for mysql is not activated, but it is supported on your system. It is recommended that you enable this to allow 4-byte UTF-8 input such as emojis, Asian symbols and mathematical symbols to be stored correctly.

In both cases the advisory also says:

See the documentation on adding 4 byte UTF-8 support for more information.

After making sure that the utf8mb4 character set is supported, you may convert the existing database to utf8mb4. Do do this you may want to install this Drush script: UTF8MB4 Convert by typing:

$ drush @none dl utf8mb4_convert-7.x
Project utf8mb4_convert (7.x-1.0) downloaded to …
Project utf8mb4_convert contains 0 modules: .

The location where the dowloaded file is stored is the directory “~/.drush/utf8mb4_convert/”.

Then navigate below the webroot of the site you want to convert, and type:

$ drush utf8mb4-convert-databases --collation=utf8mb4_danish_ci

Note: The script is not fast, so put a production site in maintenance mode before starting it.

Note 2: On Ubuntu 16.04 (PHP 7.0.33) it seems to crash when converting the tables for the Search module. Unistall it before converting, then reistall afterwards. This is not a problem on Ubuntu 20.04.

[As far as I am able to tell, rerunning the script on an already converted database (e.g. to change the collation, but not the charset) does not break anything, but this has not been extensively tested.]

Finally, alter the database connection array (in settings.php) as described in the section about setting up a new site.

Troubleshooting

Below is what I've found to be best practices to convert files and databases.

The HTTP response header

Make sure that the correct charset is indicated in the HTTP response header. The charset of he page is usually indicated in the HTTP response header like this:

Content-Type: text/html; charset=utf-8

To examine the HTTP response header in Firefox, navigate to Web Developer Tools » Network. Click on the Reload button. Click on the row of the page loaded to select it. On the right side of the screen, select the “Headers” tab.

There are also online tools that let you examine the HTTP response header, for example: webconfs.com.

There is isually no need to do this, but just in case: Setting the HTTP charset parameter.

Converting HTML files with emacs

When non-ASCII characters in a HTML-file shows up as single question marks on a black diamond, or replaced by two or more non-ASCII characters, the web server is confused about the character set used. Here is an example:

Wilhelm Röntgen
Wilhelm R�ntgen
Wilhelm Röntgen

The first line in the example sbove shows the German name of the discoverer of X-rays correctly. The second line is what you will see if iso-8859-1-encoded text is rendered as utf-8. The third line is what you will see if utf-8-encoded text is rendered as iso-8859-1.

You may use this command to check the character set encoding of one or more files:

$ file -i example.html
example.html: text/html; charset=iso-8859-1

If the character set attribute is set correctly, but the rendering is wrong because the web server ignores the attribute because it is configured to always render utf-8 (this is the default on my sites), the quickest fix is to use emacs to edit the file, and change the charset attribute of the file to utf-8. I.e. change the following line in the header from:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

To:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And save the file.

Then emacs will automagically also recode the file's characters to use utf-8-encoding.

Converting text files with iconv

The CLI utilty iconv to convert betwen different character encodings, based upon Unicode conversion.

Below are three examples of using it. The first converts from Windows-1252 (aka. CP-1252) encoding to UTF-8. The second converts from UTF-8 to plain US-ASCII while transliterating those that do not exist in US-ASCII. The third converts from ISO-8859-1 (aka. Latin1) to UTF-8, ignoring all errors.

$ iconv -f WINDOWS-1252 -t UTF-8  cp1252.txt > utf8.txt
$ iconv -f UTF-8 -t US-ASCII//TRANSLIT utf8.txt > ascii.txt
$ iconv -f ISO-8859-1 -t UTF-8//IGNORE latin1.txt > utf8.txt

See also: AtomicObject.com.

Database configuration

Modern databases such as MySQL can handle multiple character sets and has powerful capabilities to convert between them. However, this may create a great deal of confusion as it not always obvious how MySQL handles this, leading to the dreaded “double conversion” where perfectly valid multi-byte utf-8 encoded text is converted byte by byte from “latin-1” to utf-8. The best way out of this madness is to use utf-8 on all levels.

Here is some commands that may help figuring out what is going on. They can be entered through the SQL tab in phpMyAdmin or the mysql shell interface.

To find out what character sets are set in your configuration, do:

mysql> show variables like 'char%'; 
+--------------------------+----------------------------+ 
| Variable_name            | Value                      | 
+--------------------------+----------------------------+ 
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+ 
8 rows in set (0.00 sec)

You don't want to see latin1 mentioned in there. If you do, check the configuration. You can change the value of character_set_database with:

alter database DBNAME charset=utf8;

Importing legacy SQL data

Importing legacy data into a database that uses some version of utf8 may lead to encoding hell. The reason is that MySQL:

To see what the character sets are, read this answer at SE.

If any of these are out of sync with what is actually stored in the database, things tend to go very wrong.

noteUse the CLI mysql client, custom PHP programs or CLI tools like pep, enca, od or hd for examining the data. Avoid using phpMyAdmin or any elaborate client for this, since these tools may be clever enough to hide the problem.

Trying to fix these at the source may make matters worse, as explained in Whitesmith.co: MySQL encoding hell.

First, make sure that the schema character set is utf-8.

To see default for your schemas (databases):

SELECT * FROM information_schema.SCHEMATA;

To determine how non-ASCII characters actually are encoded in the SQL-dump. For this purpose pep is handy.

$ pep -x -b dump.sql > dump2.txt

Examine the dump to determine the character set used. The table below show the hex values that will show up in the expanded output for some punctation characters, two special characters, the three extra Norwegian characters and one emoji in four different encodings.

Chlatin1windows-1252utf8utf8mb4
NA0x85h-0xe2h 0x80h 0xa6h
NA0x91h-0xe2h 0x80h 0x98h
NA0x92h-0xe2h 0x80h 0x99h
NA0x93h-0xe2h 0x80h 0x9ch
NA0x94h-0xe2h 0x80h 0x9dh
NA0x96h0xe2h 0x80h 0x93h0xe2h 0x80h 0x93h
&nbsp;0xa0h0xa0h0xc2h 0xa0h0xc2h 0xa0h
&shy;0xadh0xadh0xc2h 0xadh0xc2h 0xadh
æ0xe6h0xe6h0xc3h 0xa6h0xc3h 0xa6h
ø0xf8h0xf8h0xc3h 0xb8h0xc3h 0xb8h
å0xe5h0xe5h0xc3h 0xa5h0xc3h 0xa5h
😁NANANA0xf0h 0x9fh 0x98h 0x81h

If pep is not available od can be used:

$ od -c dump.sql

If the dump contains 4 byte values for Norwegian and western european accented letters, the problem is probably that the data is utf8, but mysqldump thinks it is latin1 and converts it.

To fix this, run mysqldump with the following two flags: --skip-set-charset and --default-character-set=latin1. This prevents reconversion and setting a charset when creating the dump:

$ mysqldump -u username -p \
  --skip-set-charset --default-character-set=latin1 \
  database > dump.sql

Then remove and replace the erronous information from the dump. I use the following sed-command:

sed -e "s;CHARSET=latin1;CHARSET=utf8;g" \
    -e "s; COLLATE latin1_danish_ci ; ;g" \
    -e "s;latin1_danish_ci;utf8_danish_ci;g" < dump.sql > fixeddump.sql

The file fixeddump.sql should now import correctly.

Useful links:
Table: UTF-8 encoding table and Unicode characters
How to convert a MySQL database to UTF-8 encoding
SO: Change MySQL default character set to UTF-8 in my.cnf

If you change the configuration of mysql, you need to restart. The command to restart mysql on Ubuntu are:

$ sudo /etc/init.d/mysql restart

Illegal mix of collations

If you get the following PDOexception:

PDOException: SQLSTATE[HY000]: General error: 1267
Illegal mix of collations
(utf8mb4_general_ci,IMPLICIT) and
(utf8mb4_danish_ci,IMPLICIT) for operation '=':

you first need to identify the table causing the problem. If you want to sort using utf8mb4_danish_ci, you first need to identify the tables that has been set up with utf8mb4_general_ci. The following query will do that:

SELECT table_schema, table_name, column_name, character_set_name, collation_name
  FROM information_schema.columns
  WHERE collation_name='utf8mb4_general_ci'
    AND table_schema='database'
  ORDER BY table_name; 

Then change the collation to the one you want:

USE database;
ALTER TABLE table CONVERT TO CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_danish_ci';

You need to always include the CONVERT TO CHARACTER SET clause, even if conversion is not required.

Source: interworks.com.

Email

The following headers need to be present to send utf-8 email.

Content-Type: text/plain; charset="UTF-8";
Content-Transfer-Encoding: 8Bit

Final word

[TBA]


Last update: 2018-08-08 [gh].