Quick and Easy Postgres Data Compare
If you're checking archives or working with Postgres replication, data reconciliation can be a necessary task. Row counts can be one of the go to comparison methods but that does not show data mismatches. You could pull table data across the network and then compare each row and each field, but that can be a demand on resources. Today we'll walk through a simple solution for your Postgres toolbox - using Foreign Data Wrappers to connect and compare the two source datasets. With the foreign data wrapper and a little sql magic, we can compare data quickly and easily.
Creating Environments
To keep the environment simple so even with limited resources it can be practiced, we will use a single PostgreSQL cluster with two databases (hrprod
, hrreport
) connected via PostgreSQL Foreign Data Wrapper. The simulation here is a production database (hrprod
) with a reporting database (hrreport
). Keep in mind that the source and target do not have to be within the same PostgreSQL cluster.
For speed of creating the environment, the Crunchy Postgres for Kubernetes was used and a simple PostgreSQL cluster deployed using the Postgres Operator Examples repository.
The rest of the steps will only show the steps performed within psql
from the database containers.
Production Setup (hrprod)
The steps to create the simulated production database is simple: create the database, create the postgres_fdw
extension, create the employee
table and lastly populate the employee
table with three rows of data.
postgres=> create database hrprod;
CREATE DATABASE
postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> create extension postgres_fdw;
CREATE EXTENSION
hrprod=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE
hrprod=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
Reporting Setup (hrreport)
The steps are then repeated to create the simulated reporting database.
postgres=> create database hrreport;
CREATE DATABASE
postgres=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> create extension postgres_fdw;
CREATE EXTENSION
hrreport=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE
hrreport=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
With this, the setup is complete and the data in the employee
table match in both databases.
Data Compare
The compare will be performed from the reporting database side (hrreport
). To start, a temporary table named data_compare
is created. The data_compare
table is to store three pieces of information:
source_name
column that identifies where the data came from (hrprod
orhrreport
in this example).id
column that will store the value(s) of the primary key from the table.hash_value
column that stores the hash value of all the non-key fields in the table.
Note that if the table has a composite key, the id
column would be populated by joining the values into a single string. The hash occurs on the source side and only the hashed value is used for the comparison, greatly reducing network traffic, transfer time, etc.
Setup Data Compare
Create the data_compare
table in both the production (hrprod
) and target (hrreport
) databases.
hrreport=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> CREATE TABLE data_compare
(source_name VARCHAR(140),
id VARCHAR(1000),
hash_value varchar(100)
);
CREATE TABLE
hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> CREATE TABLE data_compare
(source_name VARCHAR(140),
id VARCHAR(1000),
hash_value varchar(100)
);
CREATE TABLE
An INSERT
statement will be executed on both the source and target to populate the data_compare
table and then the contents of the tables compared to identify differences. To reduce time and transfer for multiple compare passes, the data_compare
table contents can be transferred via the foreign table or pg_dump
, etc.
The following steps were used to create the foreign table.
hrreport=> CREATE SERVER hrprod FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'hrprod', port '5432');
CREATE SERVER
hrreport=> CREATE USER MAPPING FOR current_user SERVER hrprod options (user 'postgres', password 'welcome1');
CREATE USER MAPPING
CREATE FOREIGN TABLE hrprod_data_compare (source_name varchar(140), id varchar(1000), hash_value varchar(100)) SERVER hrprod OPTIONS (table_name 'data_compare');
Perform Initial Compare
Populate the data_compare
table in both the source (hrprod
) and target (hrreport
) databases.
hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrprod' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3
hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrreport' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3
At this point we know that the data is exactly the same so let's look at the SQL that is used to perform the actual comparison.
hrreport=> SELECT COALESCE(s.id,t.id) id,
s.hash_value source_hash_value, t.hash_value target_hash_value,
CASE WHEN s.hash_value = t.hash_value THEN 'equal'
WHEN s.id IS NULL THEN 'row not on source'
WHEN t.id IS NULL THEN 'row not on target'
ELSE 'difference'
END compare_result
FROM hrprod_data_compare s
FULL JOIN data_compare t ON s.id=t.id;
id | source_hash_value | target_hash_value | compare_result
----+----------------------------------+----------------------------------+-------------------
1 | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
2 | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
3 | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
(3 rows)
The compare_result
confirms that two sets of data are equal. An alternate compare SQL is included at the end of this article to show various ways the data can be compared when the two data_compare
tables are combined.
Create an Out-Of-Sync Condition and Compare
At this stage, three rows exist in the table and the data matches.
hrprod=> SELECT * FROM employee;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
(3 rows)
To create the out of sync, the following changes will be performed:
- In
hrprod
, add CS Lewis with id 4, Charles Babbage with id 5, Blaise Pascal with id 6. - In
hrreport
, add Charles Babbage with id 4, CS Lewis with id 5, Kenny Rogers with id 7.
Notice that the ids for CS Lewis and Charles Babbage have been swapped and a unique record added to each database (Blaise Pascal to hrprod
and Kenny Rogers to hrreport
). The compare should show that 3 rows match, 2 rows have differences and 2 rows are in one database but not the other.
Up first, changes to source (hrprod
).
hrprod=> INSERT INTO employee (id, first_name, last_name, department)
VALUES (4,'CS','Lewis','author'),(5,'Charles','Babbage','math'),(6,'Blaise','Pascal','math');
hrprod=> SELECT * FROM employee ORDER BY id;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
4 | CS | Lewis | author
5 | Charles | Babbage | math
6 | Blaise | Pascal | math
(6 rows)
Now the changes to the target (hrreport
).
hrreport=> INSERT INTO employee (id, first_name, last_name, department)
VALUES (5,'CS','Lewis','author'),(4,'Charles','Babbage','math'),(7,'Kenny','Rogers','music');
hrreport=> SELECT * FROM employee ORDER BY id;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
4 | Charles | Babbage | math
5 | CS | Lewis | author
7 | Kenny | Rogers | music
(6 rows)
To summarize the current state:
- Three rows that match (id=1, 2, 3)
- Two rows that do not match (id=4, id=5)
- Two rows that exist in one but not the other (id=6, id=7)
Let's now clear the data_compare
tables and perform the compare again.
postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> DELETE FROM data_compare;
DELETE 3
hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrprod' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6
hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> DELETE FROM data_compare;
DELETE 3
hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrreport' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6
Now for the compare and the results.
hrreport=> SELECT COALESCE(s.id,t.id) id,
s.hash_value source_hash_value, t.hash_value target_hash_value,
CASE WHEN s.hash_value = t.hash_value THEN 'equal'
WHEN s.id IS NULL THEN 'row not on source'
WHEN t.id IS NULL THEN 'row not on target'
ELSE 'difference'
END compare_result
FROM hrprod_data_compare s
FULL JOIN data_compare t ON s.id=t.id;
id | source_hash_value | target_hash_value | compare_result
----+----------------------------------+----------------------------------+-------------------
1 | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
2 | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
3 | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
4 | bbee9d6cccbeac4e9125ec78507c4eb7 | 57acef6ed228a52b8c42f0a6c155e62b | difference
5 | 57acef6ed228a52b8c42f0a6c155e62b | bbee9d6cccbeac4e9125ec78507c4eb7 | difference
6 | 047742fb256df0b78cebc3fbbc3ca4ad | | row not on target
7 | | 66e5e35673780bd392d2f81d589fbb52 | row not on source
(7 rows)
The above output indicates that rows with id = 1 thru 3 exists in both databases and the content of the rows match. Rows with id 4 and 5 exists in each database but the contents of the row is different. Going a step further, one could see that the hash values are the same between the two different rows but associated to the wrong id. Row with id 6 only exist on the target (hrreport
) while the row with id 7 only exists on the source (hrprod
). In total, there are 4 rows that are out of sync.
With the rows identified, proper steps can be performed to sync the appropriate rows. Last thought, imagine for a moment that logical replication was in place between the two databases and changes were pending on the target due to lag. The INSERT into the data_compare
could be performed only on the rows flagged as out of sync to verify just those rows once replication lag is gone.
Conclusion
Comparing data can be a monumental task. However, this little trick has come in handy over the years when expensive data compare software packages were not an option. There is still room for some creativity with the compare SQL to meet the exact needs of the compare. For example, only show rows that are missing from one side or the other.
Alternate Compare SQL:
SELECT id, hash_value,
count(src1) src1,
count(src2) src2
FROM
( SELECT a.*,
1 src1,
null src2
FROM data_compare a
WHERE source_name='hrprod'
UNION ALL
SELECT b.*,
null src1,
2 src2
FROM data_compare b
WHERE source_name='hrreport'
) c
GROUP BY id, hash_value
HAVING count(src1) <> count(src2);
So by setting up postgres_fdw, hashing the non-key fields, and writing a sql query to see if any rows are different - you can do a quick and simple Postgres data comparison. Have another solution you like for data compare? Let us know at @crunchydata.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read