Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Tuple shuffling: Postgres CTEs for Moving and Deleting Table Data

Avatar for David Christensen

David Christensen

5 min read

Recently we published an article about some of the best sql subquery tools and we were talking about all the cool things you can do with CTEs. One thing that doesn’t get mentioned near enough is the use of CTEs to do work in your database moving things around. Did you know you can use CTEs for tuple shuffling? Using CTEs to update, delete, and insert data can be extremely efficient and safe for your Postgres database.

PostgreSQL 15 included the MERGE statement, which can be similar. There are however some cases which cannot be covered by this, or if you need to use PostgreSQL versions before MERGE was introduced, this technique may come in handy.

Deleting rows and inserting to another table

A common use case where this technique can come in handy is to move rows from one table to another in a single statement. Imagine that you have a schema with a single table and an archive table and you want to move data to the archive table from the source table when it gets to be a year inactive.

This can be accomplished via something like:

WITH
  deleted AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  *
FROM
  deleted;

This straightforward approach simply returns all rows from table_a which were deleted, then inserts them into our archive table (which is assumed to have the same structure as table_a).

What happens if this transaction gets interrupted? Fortunately due to the magic of MVCC, these actions all table place in a single transaction. You can of course use explicit transaction control if you were splitting this up into multiple statements run interactively or by an app, but having it be a single statement means you are guaranteed to have this be a single transaction from the get-go.

Filtering data

Sometimes you might want to delete all of the rows, but only archive some of them; you could accomplish this by sticking the qual in the WHERE clause of the INSERT statement, for example:

WITH
  deleted AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  *
FROM
  deleted
WHERE
  priority = 'important';

Here we apply a filter to the rows which were returned by the DELETE statement and only archive those which were already marked as important. For a deeper dive into adding filters with LIMIT to UPDATE and DELETE, see my previous post.

More complicated examples

Diving in deeper, what if we had multiple tables that we wanted to archive from. Each had the same table structure. Imagine that we wanted to track the original record’s source table and modified the archive table to include this as the first field in the archive table.

We can use more CTE clauses and handle this still in one go:

WITH
  deleted_a AS (
    DELETE FROM table_a
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  ),
  deleted_b AS (
    DELETE FROM table_b
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  ),
  deleted_c AS (
    DELETE FROM table_c
    WHERE
      last_modified < now () - interval '1 year' RETURNING *
  )
INSERT INTO
  archive
SELECT
  'table_a',
  *
FROM
  deleted_a
UNION ALL
SELECT
  'table_b',
  *
FROM
  deleted_b
UNION ALL
SELECT
  'table_c',
  *
FROM
  deleted_c;

Since our INSERT statement includes all of the DELETE clauses, we will be pulling in all of the rows that were deleted from each of them and inserting them into a single table.

Update

This also works for UPDATE as well. UPDATE RETURNING will return the contents of the modified row, so we could simplify some logic to handle some more complex cases in a single query, for instance:

WITH
  target_accounts AS (
    SELECT
      id
    FROM
      accounts
    WHERE
      type = 'savings'
  ),
  balance_update AS (
    UPDATE balances
    SET
      amount = amount + 100
    FROM
      target_accounts
    WHERE
      account_id = target_accounts.id RETURNING *
  )
INSERT INTO
  awards (account_id, award)
SELECT
  account_id,
  'met savings goal'
FROM
  balance_update
WHERE
  amount >= 1000;

Partitioning

CTEs can be used to do more complicated things, like split up a table for partitioning:

WITH
  source_rows AS (
    DELETE FROM movies_unsorted RETURNING *
  ),
  action_movie_rows AS (
    INSERT INTO
      action_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'action' RETURNING id
  ),
  comedy_movie_rows AS (
    INSERT INTO
      comedy_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'comedy' RETURNING id
  ),
  romance_movie_rows AS (
    INSERT INTO
      romance_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'romance' RETURNING id
  ),
  horror_movie_rows AS (
    INSERT INTO
      horror_movies
    SELECT
      *
    FROM
      source_rows
    WHERE
      category = 'horror' RETURNING id
  )
INSERT INTO
  other_movies
SELECT
  *
FROM
  source_rows
WHERE
  id NOT IN (
    SELECT
      id
    FROM
      action_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      comedy_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      romance_movie_rows
    UNION ALL
    SELECT
      id
    FROM
      horror_movie_rows
  );

With this example, we delete our source rows for all of the movie data in “movies_unsorted”, then use multiple CTE clauses to categorize the data into the appropriate movie type, inserting in the partition that corresponds to the movies type that we’ve determined with our query, with a final catch-all that both serves to provide a way to insert any non-classified data as well as force the evaluation of the underlying CTEs (so in fact perform the INSERT into the appropriate tables).

Summary

CTEs are an important part of your toolkit and can be used for data manipulations and more complex tuple routing. Being able to name individual query pieces - including data manipulating ones like INSERT, UPDATE, or DELETE - and treating as an independent tuple source unlocks a lot of power and can be a source of creativity and problem solving.