Character sets
This chapter contains miscellaneous notes about locales and character sets, including how to set up utf-8 as the default character set for Apache and how to set up utf8mb4 for Drupal 7. It also has a section about troubleshooting character set problems.
Table of contents
- Introduction
- Apache
- Locale
- Setting up a new Drupal site with utf8mb4
- Converting an existing Drupal site
- Troubleshooting
- Final word
Drupal projects discussed in this chapter: UTF8MB4 Convert.
Introduction
The MySQL utf8mb4
encoding (4-Byte UTF-8 Unicode
Encoding) is a superset of the MySQL utf8
encoding. It is
the recommended character set for Drupal 7.50 and later. See
Drupal.org: documentation
on adding 4 byte UTF-8 support.
It is not the same as UTF-32
,
which always uses 4 bytes to store a character. The
MySQL utf8mb4
encoding will use between 1 and 4 bytes,
depending on the character. Here is a simplyfied summary:
- 1 byte: ASCII characters.
- 2 bytes: ISO-8859-X – western european accented character, Greek, Hebrew, etc.
- 3 bytes: Japanese, Korean and most Chinese characters.
- 4 bytes: The rest, including some Chinese characters and emojis.
For the details, see diagnosing charset issues.
Apache
The default chacter set for all content served by PHP is handled
correctly by Drupal. However, when a response content-type
is text/plain
or text/html
(e.g.: README.txt
),
the default character depends on the browser settings. To set an explicit
default character set in the .conf
-file for the host to “utf-8”,
use the following directive:
AddDefaultCharset utf-8
However, the exact behavior may be dependent on the user's browser configuration.
This setting disables this functionality:
AddDefaultCharset Off
Locale
In the context of web servers, a locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language identifier and a region identifier.
On the Unix family of operating systems, the format of locale identifiers are similar to IETF language tags, but the locale variant modifier is defined differently, and the character set is included as a part of the identifier. It is defined in this format:
[language[_territory][.codeset][@modifier]]
For example, Norwegoan (bokmål) using the UTF-8 encoding is nb_NO.utf8
.
The PHP scripting language has a function (setlocale) that
is used to set the locale. If succesful, it returns the new current
locale, or FALSE
if the locale functionality is not
implemented on your platform, the specified locale does not exist or
the category name is invalid.
Here is some typical calls. The comment at the end of the line is the locale we expect to be set.
$loc = setlocale(LC_ALL, "nb_NO.utf8"); // UTF-8 $loc = setlocale(LC_ALL, "nb_NO.iso88591"); // ISO-8859-1 $loc = setlocale(LC_ALL, "el_GR"); // ISO-8859-7 $loc = setlocale(LC_ALL, "pl_PL"); // ISO-8859-2 $loc = setlocale(LC_ALL, "pl_PL@euro"); // ISO-8859-15
The last example illustrates the use of the modifier
@euro
to enforce usage of the ISO-8859-15 character set,
which includes a character for the € currency sign, rather than
the ISO-8859-2 that is normally used for Eastern European languages
that are written in the latin script.
To see the locales currently installed, do:
$ locale -a
If the locale you need to use is not installed
Source SO: How do I get a locale recognized?.
$ sudo dpkg-reconfigure locales
I then get to pick a locale from long list using Spacebar. And it now appears in the OS:
$ locale -a … nb_NO.utf8
Setting up a new Drupal site with utf8mb4
To have Drupal 7 use utf8mb4
, you may first need to
change the database configuration. To learn how to do this, see to the
database documentation.
If the database is configured correctly, you need to alter the
database connection array (in settings.php
) by adding the
following keys and values (provided you want the collation used in
Denmark and Norway):
$databases['default']['default'] = array( … 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_danish_ci', );
If you want to use a more general collation (this will collate "å" with
"a"), the following may be used: utf8mb4_general_ci
.
At
present (2017-07-22) drush does not allow charset and
collating to be set. This means that when installing a new site with
drush, you get the old defaults. On bar, I hack this by
having a a per-site prepared version of settings.php
sitting in /home/gisle/configfiles/
. The clean installation script
copyies it to the new site before the drush install script is
run. Drush will not overwrite it.
Converting an existing Drupal site
If the SQL database has not been configured to enable the
the utf8mb4
character set, the Drupal status report shows
the following advisory:
4 byte UTF-8 for mysql is disabled.
This means you need to change the database configuration first. To learn how to do this, see to the database documentation.
The following advisory is shown if the database has been configured
to enable the the utf8mb4
character set:
4 byte UTF-8 for mysql is not activated, but it is supported on your system. It is recommended that you enable this to allow 4-byte UTF-8 input such as emojis, Asian symbols and mathematical symbols to be stored correctly.
In both cases the advisory also says:
See the documentation on adding 4 byte UTF-8 support for more information.
After making sure that the utf8mb4
character set is supported, you may
convert the existing database to utf8mb4
. Do do this you may
want to install this Drush script:
UTF8MB4 Convert by typing:
$ drush @none dl utf8mb4_convert-7.x Project utf8mb4_convert (7.x-1.0) downloaded to … Project utf8mb4_convert contains 0 modules: .
The location where the dowloaded file is stored is the
directory “~/.drush/utf8mb4_convert/
”.
Then navigate below the webroot of the site you want to convert, and type:
$ drush utf8mb4-convert-databases --collation=utf8mb4_danish_ci
Note: The script is not fast, so put a production site in maintenance mode before starting it.
Note 2: On Ubuntu 16.04 (PHP 7.0.33) it seems to crash when converting the tables for the Search module. Unistall it before converting, then reistall afterwards. This is not a problem on Ubuntu 20.04.
[As far as I am able to tell, rerunning the script on an already converted database (e.g. to change the collation, but not the charset) does not break anything, but this has not been extensively tested.]
Finally, alter the database connection array
(in settings.php
) as described in the section about
setting up a new site.
Troubleshooting
Below is what I've found to be best practices to convert files and databases.
The HTTP response header
Make sure that the correct charset is indicated in the HTTP response header. The charset of he page is usually indicated in the HTTP response header like this:
Content-Type: text/html; charset=utf-8
To examine the HTTP response header in Firefox, navigate to
. Click on the button. Click on the row of the page loaded to select it. On the right side of the screen, select the “Headers” tab.There are also online tools that let you examine the HTTP response header, for example: webconfs.com.
There is isually no need to do this, but just in case: Setting the HTTP charset parameter.
Converting HTML files with emacs
When non-ASCII characters in a HTML-file shows up as single question marks on a black diamond, or replaced by two or more non-ASCII characters, the web server is confused about the character set used. Here is an example:
Wilhelm Röntgen Wilhelm R�ntgen Wilhelm Röntgen
The first line in the example sbove shows the German name of the discoverer of X-rays correctly.
The second line is what you will see if iso-8859-1
-encoded text is rendered as utf-8
.
The third line is what you will see if utf-8
-encoded text is rendered as iso-8859-1
.
You may use this command to check the character set encoding of one or more files:
$ file -i example.html example.html: text/html; charset=iso-8859-1
If the character set attribute is set correctly, but the rendering
is wrong because the web server ignores the attribute because it is
configured to always render utf-8
(this is the default on
my sites), the quickest fix is to use emacs to edit the file,
and change the charset attribute of the file to utf-8
.
I.e. change the following line in the header from:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
To:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
And save the file.
Then emacs will automagically also recode the file's characters to use utf-8
-encoding.
Converting text files with iconv
The CLI utilty iconv to convert betwen different character encodings, based upon Unicode conversion.
Below are three examples of using it. The first converts from Windows-1252 (aka. CP-1252) encoding to UTF-8. The second converts from UTF-8 to plain US-ASCII while transliterating those that do not exist in US-ASCII. The third converts from ISO-8859-1 (aka. Latin1) to UTF-8, ignoring all errors.
$ iconv -f WINDOWS-1252 -t UTF-8 cp1252.txt > utf8.txt $ iconv -f UTF-8 -t US-ASCII//TRANSLIT utf8.txt > ascii.txt $ iconv -f ISO-8859-1 -t UTF-8//IGNORE latin1.txt > utf8.txt
See also: AtomicObject.com.
Database configuration
Modern databases such as MySQL can handle multiple character sets and has powerful capabilities to convert between them. However, this may create a great deal of confusion as it not always obvious how MySQL handles this, leading to the dreaded “double conversion” where perfectly valid multi-byte utf-8 encoded text is converted byte by byte from “latin-1” to utf-8. The best way out of this madness is to use utf-8 on all levels.
Here is some commands that may help figuring out what is going on. They can be entered through the SQL tab in phpMyAdmin or the mysql shell interface.
To find out what character sets are set in your configuration, do:
mysql> show variables like 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)
You don't want to see latin1 mentioned in there. If you
do, check the configuration. You can change the value of
character_set_database
with:
alter database DBNAME charset=utf8;
Importing legacy SQL data
Importing legacy data into a database that uses some version
of utf8
may lead to encoding hell. The reason is
that MySQL:
- has a default per schema (database) charset and collation
- has a per table charset and collation
- has a per text column charset and collation
To see what the character sets are, read this answer at SE.
If any of these are out of sync with what is actually stored in the database, things tend to go very wrong.
Use the CLI mysql client, custom PHP programs or CLI tools like pep, enca, od or hd for examining the data. Avoid using phpMyAdmin or any elaborate client for this, since these tools may be clever enough to hide the problem.
Trying to fix these at the source may make matters worse, as explained in Whitesmith.co: MySQL encoding hell.
First, make sure that the schema character set is utf-8
.
To see default for your schemas (databases):
SELECT * FROM information_schema.SCHEMATA;
To determine how non-ASCII characters actually are encoded in the SQL-dump. For this purpose pep is handy.
$ pep -x -b dump.sql > dump2.txt
Examine the dump to determine the character set used. The table below show the hex values that will show up in the expanded output for some punctation characters, two special characters, the three extra Norwegian characters and one emoji in four different encodings.
Ch | latin1 | windows-1252 | utf8 | utf8mb4 |
---|---|---|---|---|
… | NA | 0x85h | - | 0xe2h 0x80h 0xa6h |
‘ | NA | 0x91h | - | 0xe2h 0x80h 0x98h |
’ | NA | 0x92h | - | 0xe2h 0x80h 0x99h |
“ | NA | 0x93h | - | 0xe2h 0x80h 0x9ch |
” | NA | 0x94h | - | 0xe2h 0x80h 0x9dh |
– | NA | 0x96h | 0xe2h 0x80h 0x93h | 0xe2h 0x80h 0x93h |
| 0xa0h | 0xa0h | 0xc2h 0xa0h | 0xc2h 0xa0h |
­ | 0xadh | 0xadh | 0xc2h 0xadh | 0xc2h 0xadh |
æ | 0xe6h | 0xe6h | 0xc3h 0xa6h | 0xc3h 0xa6h |
ø | 0xf8h | 0xf8h | 0xc3h 0xb8h | 0xc3h 0xb8h |
å | 0xe5h | 0xe5h | 0xc3h 0xa5h | 0xc3h 0xa5h |
😁 | NA | NA | NA | 0xf0h 0x9fh 0x98h 0x81h |
If pep is not available od can be used:
$ od -c dump.sql
If the dump contains 4 byte values for Norwegian and western
european accented letters, the problem is probably that the data
is utf8
, but mysqldump thinks it
is latin1
and converts it.
To fix this, run mysqldump with the following two flags:
--skip-set-charset
and --default-character-set=latin1
.
This prevents reconversion and setting a charset when creating the dump:
$ mysqldump -u username -p \ --skip-set-charset --default-character-set=latin1 \ database > dump.sql
Then remove and replace the erronous information from the dump. I use the following sed-command:
sed -e "s;CHARSET=latin1;CHARSET=utf8;g" \ -e "s; COLLATE latin1_danish_ci ; ;g" \ -e "s;latin1_danish_ci;utf8_danish_ci;g" < dump.sql > fixeddump.sql
The file fixeddump.sql
should now import correctly.
- Useful links:
- Table: UTF-8 encoding table and Unicode characters
How to convert a MySQL database to UTF-8 encoding
SO: Change MySQL default character set to UTF-8 in my.cnf
If you change the configuration of mysql, you need to restart. The command to restart mysql on Ubuntu are:
$ sudo /etc/init.d/mysql restart
Illegal mix of collations
If you get the following PDOexception:
PDOException: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8mb4_danish_ci,IMPLICIT) for operation '=':
you first need to identify the table causing the problem.
If you want to sort using utf8mb4_danish_ci
, you
first need to identify the tables that has been set up with
utf8mb4_general_ci
. The following query will do that:
SELECT table_schema, table_name, column_name, character_set_name, collation_name FROM information_schema.columns WHERE collation_name='utf8mb4_general_ci' AND table_schema='database' ORDER BY table_name;
Then change the collation to the one you want:
USE database; ALTER TABLE table CONVERT TO CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_danish_ci';
You need to always include the CONVERT TO CHARACTER
SET
clause, even if conversion is not required.
Source: interworks.com.
The following headers need to be present to send utf-8 email.
Content-Type: text/plain; charset="UTF-8"; Content-Transfer-Encoding: 8Bit
Final word
[TBA]
Last update: 2018-08-08 [gh].