Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Postgres Migration Pitstop: Collations

Avatar for Jesse Soyland

Jesse Soyland

6 min read

At Crunchy Data we spend a lot of time helping customers migrate their databases. Migrating Postgres tends to be a very straightforward process. Yet there can still be a few gotchas that can catch you off-guard if you are not prepared to deal with them. From some recent experiences with customers migrating to Crunchy Bridge we found most customers had not considered the underlying collations. These customers ran a risk of data corruption by not handling collation review and updates as part of their migration. A mismatched glibc is one of those details that could actually be a big gotcha and quite the headache if you are unaware of it - so we wanted to cover a few quick details.

Why should I care about mismatched glibc?

Using mismatched glibc versions can have a risk of:

  • Missing data when you query it
  • Inconsistent sorting between versions
  • Undetected unique constraint violations

These can all result in data corruption issues. For example, if you have a unique constraint on email addresses and the sort returns is different across versions - you may now have two accounts for users. You may get empty results when you query. Reconciling data corruption may be simple if it is a single record, but the longer it lives the bigger the cleanup and can result in weeks of pain.

We’ve seen differences between glibc versions when:

  • Using physical replication to migrate databases (i.e., wal-e, wal-g, pgbackrest) from one host to a new one.
  • Restoring a binary backup (i.e., pg_basebackup) on a system with different OS configuration.
  • Upgrading the Linux distribution to a new major release while keeping the PostgreSQL data directory. In this case, the glibc version may have changed but your underlying data did not.

Not all types of migrations or replication are affected by this inconsistency. Situations where the data is transported in a logical (not binary) way are quite safe, including:

  • Backups and restore processes using pg_dump, since these use logical data only
  • Logical replication, which uses only a data copy and not the physical copy

How the sorting works

For a very simple but practical example, on glibc versions older than 2.28 we can run this query and see how data sorts.

old-glibc::DATABASE=> SELECT * FROM (values ('a'), ('$a'), ('a$'), ('b'), ('$b'), ('b$'), ('A'), ('B'))                                                                                                      AS l(x) ORDER BY x ;
 x
----
 a
 $a
 a$
 A
 b
 $b
 b$
 B
(8 rows)

Then run the same on a newer version:

new-glibc::DATABASE=> SELECT * FROM (values ('a'), ('$a'), ('a$'), ('b'), ('$b'), ('b$'), ('A'), ('B'))                                                                                                        AS l(x) ORDER BY x ;
 x
----
 $a
 $b
 a
 A
 a$
 b
 B
 b$
(8 rows)

Thanks to @DanielVerite for a great example in his write-up on glibc and Postgres. Let’s dig in a bit more though.

What is glibc?

Libc is the main C library used by the Linux system. Many Linux programs, including Postgres, use the glibc implementation. It is used to provide many fundamental software operations and is used inside Postgres to do things like sorting text or comparing data when creating indexes.

A major update released with glibc 2.28 in 2018 brought localization and collation information into compliance with the 2016 Edition 4 ISO 14651 standards. With the update, indexes that were created with a prior version of the collations potentially exhibit corruption when being read by a system using the updated collations. If there is a mis-match the indexes must be rebuilt to avoid issues.

What collations are you using?

You can find the data collation your databases are using via the datcollate field of pg_database.

SELECT datname, datcollate FROM pg_database;
  datname  | datcollate
-----------+-------------
 postgres  | en_US.UTF-8
 demo      | en_US.UTF-8
 template1 | en_US.UTF-8
 template0 | en_US.UTF-8
(4 rows)

And to check your glibc version (this query is environment dependent)

select collname, collversion from pg_collation where collprovider = 'c';
     collname     | collversion
------------------+-------------
 C                |
 POSIX            |
 ucs_basic        |
 C.utf8           |
 en_AG            | 2.28
 en_AU            | 2.28
 en_AU.utf8       | 2.28

How do I fix it?

Fix during a migration

Since this issue shows up with binary data that is moved across operating system’s glibc versions, this generally shows up during a migration. Migrating via logical replication or logical backup (i.e., pg_dump) eliminates the issue as any affected indexes will be recreated at the time. So changing direction to logical restore might be worth thinking about.

For large databases, in excess of 100GB, logical backup migrations can take longer than desirable. In those cases, WAL migration followed by rebuilding affected indexes is generally the method we prefer in order to minimize downtime and engineering effort.

On a live database

If you think you might have an issue collations post migration, the amcheck extension help identify any data inconsistencies.

SELECT bt_index_check(index => c.oid, heapallindexed => true),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages;

bt_index_check |                    relname                    | relpages
----------------+-----------------------------------------------+----------
                | pg_publication_pubname_index                  |        1
                | pg_largeobject_loid_pn_index                  |        1
                | pg_largeobject_metadata_oid_index             |        1

If bt_index_check is empty, there’s no indexes that would change collations if a REINDEX is run. If the index check shows information, you’ll likely need to do REINDEX.

Side note: The amcheck extension can be a somewhat resource intensive process to run, both in terms of I/O and time. If you have a large or critical database, consider running this on a physical replica so you don’t disrupt production workflows, as this can be detected on a binary copy of the database.

Reindex

If you’ve found an issue with the above steps, you’ll need to REINDEX or REINDEX CONCURRENTLY. (Note: If you are using Postgres 14, we recommend to use 14.4 or higher to properly REINDEX CONCURRENTLY to avoid further risk of corruption).

Have questions?

Data migrations can often be straightforward but you want to make sure you ask the right questions instead of assuming things will just work. We hope you’ll have found this helpful as you consider a migration, but if you have additional questions please reach out as we may be able to help.

co-authored with Elizabeth Christensen, David Christensen, and Craig Kerstiens