Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Quick and Easy Postgres Data Compare

Avatar for Brian Pace

Brian Pace

9 min read

If you're checking archives or working with Postgres replication, data reconciliation can be a necessary task. Row counts can be one of the go to comparison methods but that does not show data mismatches. You could pull table data across the network and then compare each row and each field, but that can be a demand on resources. Today we'll walk through a simple solution for your Postgres toolbox - using Foreign Data Wrappers to connect and compare the two source datasets. With the foreign data wrapper and a little sql magic, we can compare data quickly and easily.

Creating Environments

To keep the environment simple so even with limited resources it can be practiced, we will use a single PostgreSQL cluster with two databases (hrprod, hrreport) connected via PostgreSQL Foreign Data Wrapper. The simulation here is a production database (hrprod) with a reporting database (hrreport). Keep in mind that the source and target do not have to be within the same PostgreSQL cluster.

For speed of creating the environment, the Crunchy Postgres for Kubernetes was used and a simple PostgreSQL cluster deployed using the Postgres Operator Examples repository.

The rest of the steps will only show the steps performed within psql from the database containers.

Production Setup (hrprod)

The steps to create the simulated production database is simple: create the database, create the postgres_fdw extension, create the employee table and lastly populate the employee table with three rows of data.

postgres=> create database hrprod;
CREATE DATABASE

postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> create extension postgres_fdw;
CREATE EXTENSION

hrprod=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrprod=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3

Reporting Setup (hrreport)

The steps are then repeated to create the simulated reporting database.

postgres=> create database hrreport;
CREATE DATABASE

postgres=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> create extension postgres_fdw;
CREATE EXTENSION

hrreport=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE

hrreport=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3

With this, the setup is complete and the data in the employee table match in both databases.

Data Compare

The compare will be performed from the reporting database side (hrreport). To start, a temporary table named data_compare is created. The data_compare table is to store three pieces of information:

  • source_name column that identifies where the data came from (hrprod or hrreport in this example).
  • id column that will store the value(s) of the primary key from the table.
  • hash_value column that stores the hash value of all the non-key fields in the table.

Note that if the table has a composite key, the id column would be populated by joining the values into a single string. The hash occurs on the source side and only the hashed value is used for the comparison, greatly reducing network traffic, transfer time, etc.

Setup Data Compare

Create the data_compare table in both the production (hrprod) and target (hrreport) databases.

hrreport=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> CREATE TABLE data_compare
        (source_name VARCHAR(140),
	    id VARCHAR(1000),
	    hash_value varchar(100)
        );
CREATE TABLE

An INSERT statement will be executed on both the source and target to populate the data_compare table and then the contents of the tables compared to identify differences. To reduce time and transfer for multiple compare passes, the data_compare table contents can be transferred via the foreign table or pg_dump, etc.

The following steps were used to create the foreign table.

hrreport=> CREATE SERVER hrprod FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'hrprod', port '5432');
CREATE SERVER

hrreport=> CREATE USER MAPPING FOR current_user SERVER hrprod options (user 'postgres', password 'welcome1');
CREATE USER MAPPING

CREATE FOREIGN TABLE hrprod_data_compare (source_name varchar(140), id varchar(1000), hash_value varchar(100)) SERVER hrprod OPTIONS (table_name 'data_compare');

Perform Initial Compare

Populate the data_compare table in both the source (hrprod) and target (hrreport) databases.

hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3


hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3

At this point we know that the data is exactly the same so let's look at the SQL that is used to perform the actual comparison.

hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
(3 rows)

The compare_result confirms that two sets of data are equal. An alternate compare SQL is included at the end of this article to show various ways the data can be compared when the two data_compare tables are combined.

Create an Out-Of-Sync Condition and Compare

At this stage, three rows exist in the table and the data matches.

hrprod=> SELECT * FROM employee;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
(3 rows)

To create the out of sync, the following changes will be performed:

  • In hrprod, add CS Lewis with id 4, Charles Babbage with id 5, Blaise Pascal with id 6.
  • In hrreport, add Charles Babbage with id 4, CS Lewis with id 5, Kenny Rogers with id 7.

Notice that the ids for CS Lewis and Charles Babbage have been swapped and a unique record added to each database (Blaise Pascal to hrprod and Kenny Rogers to hrreport). The compare should show that 3 rows match, 2 rows have differences and 2 rows are in one database but not the other.

Up first, changes to source (hrprod).

hrprod=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (4,'CS','Lewis','author'),(5,'Charles','Babbage','math'),(6,'Blaise','Pascal','math');

hrprod=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | CS         | Lewis      | author
  5 | Charles    | Babbage    | math
  6 | Blaise     | Pascal     | math
(6 rows)

Now the changes to the target (hrreport).

hrreport=> INSERT INTO employee (id, first_name, last_name, department)
        VALUES (5,'CS','Lewis','author'),(4,'Charles','Babbage','math'),(7,'Kenny','Rogers','music');

hrreport=> SELECT * FROM employee ORDER BY id;
 id | first_name | last_name  | department
----+------------+------------+------------
  1 | John       | Smith      | explorer
  2 | George     | Washington | government
  3 | Thomas     | Edison     | inventor
  4 | Charles    | Babbage    | math
  5 | CS         | Lewis      | author
  7 | Kenny      | Rogers     | music
(6 rows)

To summarize the current state:

  • Three rows that match (id=1, 2, 3)
  • Two rows that do not match (id=4, id=5)
  • Two rows that exist in one but not the other (id=6, id=7)

Let's now clear the data_compare tables and perform the compare again.

postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".

hrprod=> DELETE FROM data_compare;
DELETE 3

hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrprod' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6

hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".

hrreport=> DELETE FROM data_compare;
DELETE 3

hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
  (SELECT 'hrreport' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6

Now for the compare and the results.

hrreport=> SELECT COALESCE(s.id,t.id) id,
              s.hash_value source_hash_value, t.hash_value target_hash_value,
              CASE WHEN s.hash_value = t.hash_value THEN 'equal'
                    WHEN s.id IS NULL THEN 'row not on source'
                    WHEN t.id IS NULL THEN 'row not on target'
                    ELSE 'difference'
              END compare_result
        FROM hrprod_data_compare s
            FULL JOIN data_compare t ON s.id=t.id;


 id |        source_hash_value         |        target_hash_value         |  compare_result
----+----------------------------------+----------------------------------+-------------------
 1  | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
 2  | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
 3  | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
 4  | bbee9d6cccbeac4e9125ec78507c4eb7 | 57acef6ed228a52b8c42f0a6c155e62b | difference
 5  | 57acef6ed228a52b8c42f0a6c155e62b | bbee9d6cccbeac4e9125ec78507c4eb7 | difference
 6  | 047742fb256df0b78cebc3fbbc3ca4ad |                                  | row not on target
 7  |                                  | 66e5e35673780bd392d2f81d589fbb52 | row not on source
 (7 rows)

The above output indicates that rows with id = 1 thru 3 exists in both databases and the content of the rows match. Rows with id 4 and 5 exists in each database but the contents of the row is different. Going a step further, one could see that the hash values are the same between the two different rows but associated to the wrong id. Row with id 6 only exist on the target (hrreport) while the row with id 7 only exists on the source (hrprod). In total, there are 4 rows that are out of sync.

With the rows identified, proper steps can be performed to sync the appropriate rows. Last thought, imagine for a moment that logical replication was in place between the two databases and changes were pending on the target due to lag. The INSERT into the data_compare could be performed only on the rows flagged as out of sync to verify just those rows once replication lag is gone.

Conclusion

Comparing data can be a monumental task. However, this little trick has come in handy over the years when expensive data compare software packages were not an option. There is still room for some creativity with the compare SQL to meet the exact needs of the compare. For example, only show rows that are missing from one side or the other.

Alternate Compare SQL:

SELECT id, hash_value,
       count(src1) src1,
       count(src2) src2
 FROM
     ( SELECT a.*,
              1 src1,
              null src2
        FROM data_compare a
        WHERE source_name='hrprod'
        UNION ALL
        SELECT b.*,
               null src1,
               2 src2
        FROM data_compare b
        WHERE source_name='hrreport'
    ) c
 GROUP BY id, hash_value
 HAVING count(src1) <> count(src2);

So by setting up postgres_fdw, hashing the non-key fields, and writing a sql query to see if any rows are different - you can do a quick and simple Postgres data comparison. Have another solution you like for data compare? Let us know at @crunchydata.