Parallel Queries in Postgres
Many folks are surprised to hear that Postgres has parallel queries out of the box. This was released in small batches across a half dozen versions of Postgres, so the major fanfare for having parallelism got a little bit lost.
By default Postgres is configured for two parallel workers. The Postgres query planner will assemble a few plans for any given query and will estimate the additional overhead of performing parallel queries, and make a go or no-go decision. Depending on the settings and the calculations of the query planner, parallel queries are typically used by large and long running queries — like warehouse or analytical workloads.
Below is the output of a sample EXPLAIN
when the query planner uses 2 parallel workers:
EXPLAIN (ANALYZE, VERBOSE, COSTS, BUFFERS)
SELECT * FROM orders WHERE order_date > '2024-01-01';
QUERY PLAN
-------------------------------------------------------------------------------------
Gather (cost=1000.00..11000.00 rows=10000 width=123) (actual time=0.051..12.345 rows=8500 loops=1)
Output: order_id, customer_id, order_date, order_amount
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=50 read=1000
-> Parallel Seq Scan on public.orders (cost=0.00..9000.00 rows=4167 width=123) (actual time=0.028..10.045 rows=2833 loops=3)
Output: order_id, customer_id, order_date, order_amount
Filter: (orders.order_date > '2024-01-01'::date)
Rows Removed by Filter: 11667
Buffers: shared hit=50 read=1000
Planning Time: 0.123 ms
Execution Time: 12.567 ms
Parallel queries were first released in Postgres 10 and we are currently at 16, with version 17 right around the corner. Every major release since version 10 has included performance improvements for parallel queries and has implemented parallelism into indexes, joins, and vacuum. Starting small and expanding is a common theme of Postgres development. Even PostGIS and pgvector take advantage of parallelism for queries and index builds.
When are parallel queries important
First off, parallel queries are not a substitution for optimizing queries with indexes. If your application is running the same CRUD operations over and over and over, use indexes appropriately. See Demystifying Database Performance for the ins-and-outs of indexes or Postgres Indexes for Newbies.
However, if you have long running queries against a warehouse (or a bigger-slower-database), you may want to tune your database for parallel query work. Any query where your database is performing comparisons or aggregations on a large set of data, it works. Such as:
- unpredictable queries on a table that contains JSONB columns
- aggregations on JSONB columns
- select queries with many different conditionals and distinct
OR
statements
The official types of operations where the Postgres Query Planner will attempt to include parallel queries is:
- parallel sequential scans (i.e. returning a row from storage and filtering it by a conditional)
- parallels index scans (i.e. returning a row from an index and filtering it by a conditional)
- parallel aggregate functions (i.e. performing a calculation on the filtered rows)
- parallel joins (i.e. finding the matching rows of a joined table)
The caveat is that parallel only works on select queries — not transactions that include inserts, updates, and deletes.
Parallel workers for queries, autovacuum, and indexes
Postgres has implemented parallel workers for the autovacuum system and for index building too. Query, maintenance, and index workers come out of the same pool because all of these workers require resources — and servers have a limited number of resources.
Parallel queries and the Postgres planner
Under the hood Postgres knows about a couple things:
parallel_setup_cost
an estimated cost that it takes to launch parallel worker processesparallel_tuple_cost
an estimated cost for transferring a tuple between worker processesmin_parallel_table_scan_size
- minimum data size that Postgres will consider for parallel table scan, defaults to 8mbmin_parallel_index_scan_size
- minimum data size that Postgres will consider for an index scan, defaults to 512kb (index scans are faster than table scans)
These are configurable, but generally only for testing. For real database workloads, it is best to let the planner do everything for you.
Tuning Postgres parallel queries
max_worker_processes: Sets the maximum number of total worker processes allowed for the entire PostgreSQL instance, the default value is 8. This number includes any workers used for parallel queries. The general rule of thumb is to make this 25% of total vCPU count or greater. Some set this to the CPU count of the machine to take advantage of the most parallel workers.
SET max_worker_processes = 8;
max_parallel_workers: Sets the maximum number of parallel query worker processes allowed for the entire PostgreSQL instance, the default value is 8. By setting this value equal to the max_worker_processes
, when no maintenance work is being run, Postgres will use all workers for queries. Conversely, high-transaction-rate systems limit the parallel workers to allocate workers for maintenance.
SET max_parallel_workers = 8;
max_parallel_workers_per_gather: Specifies the maximum number of workers that can be started by a single query. The default value is 2. The general rule of thumb here is that 2 might not always be enough. You can safely set this to half of your CPU count, or even your full CPU count if you want to make the most out of parallel queries.
SET max_parallel_workers_per_gather = 4;
Each parallel worker gets its own work_mem
Postgres DBAs track working memory (work_mem
) to make sure the processes have enough memory. If a worker needs to allocate memory each backend will allocate work_mem
. For example, if a query spawns 4 worker processes for parallel execution, and work_mem
is set to 32MB, each worker process will allocate 32MB, and this single operation would allocate 128MB of memory.
Parallel query workers use a connection slot
Parallel workers contribute to the overall number of backends that Postgres knows about. On startup, Postgres allocates memory and resources for the number of connections and auxiliary processes that it will use during its lifetime. Normal client connections are configured using the max_connections
server parameter, so the total number of client backends would be the sum of max_connections + max_parallel_workers
.
When tuning memory limits you will need to consider the number of parallel workers and the max_connections
parameter. The general rule of thumb is that 25% of your total system memory be dedicated to that work_mem
. You should ensure (max_connections + max_parallel_workers) * work_mem
does not exceed more than 25% of total memory.
Parallel query scaling
To see how effectively parallel queries can speed things up on a larger server, my colleague Greg Smith setup his benchmark lab with a pair of 32 core servers with all the limits removed:
max_worker_processes = 32
max_parallel_workers = 32
max_parallel_workers_per_gather = 32
max_parallel_maintenance_workers = 16
His CBC workload setup a 384GB data set to overflow the 128GB of system RAM. max_parallel_workers
was changed for each query to show how the process scaled upwards. Best results were from the Intel i9-14900K, which scanned just over 400GB in 290 seconds with one process, while finishing in only 88 seconds at the peak of 10 parallel workers:
The full details of the EXPLAIN (ANALYZE ON, BUFFERS ON) plan for the fastest run is available at https://explain.depesz.com/s/vFjf ; it achieved 2.4GB/s running the parallel sequential scan across the 10 processes.
Here are the highlights demonstrating how the worker memory can pile up:
Finalize HashAggregate (cost=59618234.67..59658351.42 rows=4011675 width=12) (actual time=87567.233..87974.729 rows=4800001 loops=1)
Group Key: seq
Batches: 1 Memory Usage: 499729kB
Buffers: shared hit=2080 read=52723028
I/O Timings: shared read=168932.639
-> Gather (cost=55365859.17..59417650.92 rows=40116750 width=12) (actual time=72785.273..74464.012 rows=52800011 loops=1)
Workers Planned: 10
Workers Launched: 10
Buffers: shared hit=2080 read=52723028
I/O Timings: shared read=168932.639
-> Partial HashAggregate (cost=55364859.17..55404975.92 rows=4011675 width=12) (actual time=72713.940..73299.108 rows=4800001 loops=11)
Group Key: seq
Batches: 1 Memory Usage: 499729kB
Buffers: shared hit=2080 read=52723028
I/O Timings: shared read=168932.639
Worker 0: Batches: 1 Memory Usage: 499729kB
Worker 1: Batches: 1 Memory Usage: 499729kB
Worker 2: Batches: 1 Memory Usage: 499729kB
Worker 3: Batches: 1 Memory Usage: 499729kB
Worker 4: Batches: 1 Memory Usage: 499729kB
Worker 5: Batches: 1 Memory Usage: 499729kB
Worker 6: Batches: 1 Memory Usage: 499729kB
Worker 7: Batches: 1 Memory Usage: 499729kB
Worker 8: Batches: 1 Memory Usage: 499729kB
Worker 9: Batches: 1 Memory Usage: 499729kB
-> Parallel Seq Scan on settings_loop (cost=0.00..54484942.11 rows=175983411 width=4) (actual time=0.172..22603.500 rows=159709124 loops=11)
Buffers: shared hit=2080 read=52723028
I/O Timings: shared read=168932.63
Real complicated queries with joins and sorts can use serious amounts of memory when unleashed in parallel. It’s better to slowly increase the turning parameters and review how queries run than to go too wild before you know your server.
Parallelism is not …
Parallelism is not multi-threaded Postgres. You may have heard of parallel computing popularized by GPU servers like those from Nvidia. While multi-threading runs multiple threads within a single process sharing the same memory, parallelism in Postgres splits a query into independent tasks executed by multiple processes, leveraging multi-core processors for efficiency. Multi-threading in Postgres is still under discussion due to the complexity of safely sharing memory and state across threads. I just saw an great keynote at pgconf.dev on this topic by Dr. Margo Seltzer. Some discussion about multi-threading during the conference as well if you’re curious.
Parallelism is not vectorized. When databases process data in batches, rather than row by row, that is called vectorized execution. When a vectorized query is run, data is processed in chunks, applying the same operation across multiple rows simultaneously. This method not only speeds up computation but also improves memory access patterns, reducing the number of CPU cache misses. Postgres does not currently have this natively, though by fusing Postgres with with DuckDB this is possible.
Summary
- Postgres has parallel queries out of the box and will initiate a two worker parallel query without any changes or settings.
- For many analytical workloads, tuning parallel querying and adding additional workers can help with performance.
- In benchmark testing, effectiveness parallel workers per query peaked at 10.
- Slowly changing the parallel query tuning is recommended, rather than a set tuning requirement.
Co-authored with Greg Smith
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read