A Look at Postgres 15: MERGE Command with Examples
With PostgreSQL 15 comes a new SQL Command called MERGE. MERGE
has been in the SQL standard for quite a while, but it just made its way into the PostgreSQL codebase. Prior to MERGE
, if you wanted to refresh a target table from a source table, prior to Postgres 15, you could use the "upsert" method with the ON CONFLICT
clause.
Now, MERGE
can be used instead! Some situations where MERGE
makes a lot of sense are:
- data loading from external sources, thru foreign data wrappers
- staged and batched process jobs
About MERGE
Let's look at the synopsis in the documentation:
[[ WITH with_query [, ...] ]
MERGE INTO target_table_name [ [ AS ] target_alias ]
USING data_source ON join_condition
when_clause [...]
where data_source is:
{ source_table_name | ( source_query ) } [ [ AS ] source_alias ]
and when_clause is:
{ WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } }
and merge_insert is:
INSERT [( column_name [, ...] )]
[ OVERRIDING { SYSTEM | USER } VALUE ]
{ VALUES ( { expression | DEFAULT } [, ...] ) | DEFAULT VALUES }
and merge_update is:
UPDATE SET { column_name = { expression | DEFAULT } |
( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...]
and merge_delete is:
DELETE
We can read that the source of the data_source
can be either a table or some data brought to the command thanks to the prior WITH
with_query
, or inside the MERGE
after USING
source_query
. When the data matches, i.e. there's a matching record in the target_table_name
, we can specify what to do:
- do an
UPDATE
as defined in themerge_update
(seeUPDATE SET...
) part or - do a
DELETE
as defined in themerge_delete
or - do nothing
When the data doesn't matches, i.e. there's no matching record in the target_table_name
, then we do an INSERT
(see merge_update
).
MERGE
example - remote sensors & batches
In working with clients, I have seen a need for data loading with remote sensors or stations. Anything “sending regularly data” and loading that data into a dataset is a common need. I’m particularly excited about using MERGE
to solve these issues for Postgres users.
In order to explain this data loading use case of MERGE
, I’m going to set up an example. We’ll have a database of stations that’s a remote measurement tool with data coming in intermittently.
- Streaming data, data arriving continuously
- Stations emit periodic and intermittent data measures
- There's a batch collecting this data
station_data_new
is a temp table for common storage- Data is stored long term
station_data_actual
with the last possible values - We want to keep track of when the station has been
created
Let's create some tables for testing purpose. Beware that the temporary table station_data_new
will only exist in the context of a given session.
create temporary table station_data_new (
station_id integer
, a integer
, b integer
);
create table station_data_actual (
station_id integer primary key
, a integer
, b integer
, created timestamp default current_timestamp
, updated timestamp default current_timestamp
);
Let's create some sample data into station_data_new
: our 1st 5 stations are up and sent the 1st batch of data:
with measures as (
select *
from generate_series(1,5)
)
insert into station_data_new (
station_id
, a
, b
)
select
generate_series
, round(random()*100+1)
, round(random()*100+1)
from
measures;
Basic MERGE
At this point, we could just do a plain INSERT
with SELECT
from station_data_new
to station_data_actual
. Instead we’ll use MERGE
because we’re planning for cases when the data already exists, an UPDATE
will be issued and not an INSERT
.
merge into station_data_actual sda
using station_data_new sdn
on sda.station_id = sdn.station_id
when matched then
update set a = sdn.a, b = sdn.b, updated = default
when not matched then
insert (station_id, a, b)
values (sdn.station_id, sdn.a, sdn.b);
If you execute it once, you'll have in return MERGE 5
, and the data is inserted into station_data_actual
, where the timestamps in created
and updated
have the same value.
To do even more testing you could:
truncate table station_data_new;
- Re-create data in it, by changing
generate_series(1,5)
withgenerate_series(1,10)
- Issue again the very same
MERGE
as we ran before
You'll see the data in station_data_actual
updated. Stations 1 to 5 will have updated
status and stations 6 to 10 will be created
and updated
.
Conclusion
MERGE
was a long awaited feature for PostgreSQL fans! Now we have it. I encourage you to review your processes when it's about merging existing data in your database. In this example, new data entering with existing data can be a good place to use this.
MERGE
opens new usages of aggregating and/or merging data from many databases. In distributed models, where all data exists across different locations, using foreign data wrappers with MERGE
could be a really elegant solution. MERGE
may simplify your processing of data all along the database's life. With strong and fast SQL statements as opposed to functions or other complex operations.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read