Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Real World Performance Gains With Postgres 17 B-tree Bulk Scans

Avatar for Brandur Leach

Brandur Leach

8 min read

With RC1 freshly cut, the release of Postgres 17 is right on the horizon, giving us a host of features, improvements, and optimizations to look forward to.

As a backend developer, one in particular pops off the page, distinguishing itself amongst the dozens of new release items:

Allow btree indexes to more efficiently find a set of values, such as those supplied by IN clauses using constants (Peter Geoghegan, Matthias van de Meent)

The B-tree is Postgres' overwhelmingly most common and best optimized index, used for lookups on a table's primary key or secondary indexes, and undoubtedly powering all kinds of applications all over the world, many of which we interact with on a daily basis.

During lookups, a B-tree is scanned, with Postgres descending down through its hierarchy from the root until it finds a target value on one of its leaf pages. Previously, multi-value lookups like id IN (1, 2, 3) or id = any(1, 2, 3) would require that process be repeated multiple times, once for each of the requested values. Although not perfectly efficient, it wasn't a huge problem because B-tree lookups are very fast. It'd take an extremely performance sensitive user to even notice the deficiency.

As of a Postgres 17 enhancement to nbtree's ScalaryArrayOp execution, that's no longer always the case. Any particular scan with multiple scalar inputs will consider all those inputs as it's traversing a B-tree, and where multiple values land on the same leaf page, they're retrieved together to avoid repetitive traversals.

A narrowly focused script to demonstrate the original problem shows a dramatic performance increase before and after ScalaryArrayOp improvement, so we already know the gains are very real. With Postgres 17 so close to hand, we wanted to try to measure what kind of gain a realistic web app might expect from the optimization by testing it against the real API service that powers Crunchy Bridge.

In our experiment we saw roughly a 30% improvement in throughput 20% drop in average request time -- promising to say the least. Read on for details.

List endpoints and eager loading

The API is a production-grade (i.e. has bells and whistles like auth, telemetry, and defensive hardening) program written in Go. I chose its GET /teams/:id/members endpoint (team member list) as a test dummy because it's a good middle ground between performance and sophistication. Substantial enough to be able to benefit from the index improvements, but simple enough as to stay easy to understand.

It returns a list of team member API resources:

// A team member.
type TeamMember struct {
    apiresource.APIResourceBase

    // Primary ID of the team member record.
    ID eid.EID `json:"id" validate:"-"`

    // Properties of the account associated with the team member.
    Account *Account `json:"account" validate:"-"`

    // The role assigned to the team member.
    Role dbsqlc.TeamMemberRole `json:"role" validate:"required,teammemberrole"`
}

// Account information nested in a team member.
type Account struct {
    // Primary ID of the account.
    ID eid.EID `json:"id" validate:"-"`

    // Email associated with the account.
    Email string `json:"email" validate:"required,email,apistring200"`

    // Indicates that the account has a password set, as opposed to being
    // SSO-only with no usable password. It's possible for an account to have
    // both a password and a federated identity through a provider like Google
    // or Microsoft.
    HasPassword *bool `json:"has_password" validate:"required"`

    // Indicates that the account has a federated identity for single sign-on
    // through an identity provider like Google or Microsoft.
    HasSSO *bool `json:"has_sso" validate:"required"`

    // Whether the account has at least one activated multi-factor source.
    MultiFactorEnabled *bool `json:"multi_factor_enabled" validate:"required"`

    // Full name associated with the account.
    Name string `json:"name" validate:"required,apistring200"`
}

The team member itself is minimal, containing only ID and role as properties of its own, but embedding an account with detail on the user that's linked to the team member (a team member in this instance can be thought of as a join table between an account and a team).

An account has obvious properties like an email and name, but also a few less common ones like has_password and multi_factor_enabled which are used while rendering a list of team members in the UI to show badges next to each person for security features like "SSO-only (password-less) account" or "multi-factor enabled", thereby letting an admin vet the security posture of everyone on their team, giving them the information they need to reach out to team members who for example don't have MFA enabled.

These specifics aren't important, but demonstrative of a common pattern in which multiple database records are needed to render a final product. Team members and accounts are backed directly by their own database models, but although they're booleans, has_password and multi_factor_enabled need to load federated identity and multi-factor credential records for the associated account.

The simplest possible version of loading a page of team members and rendering API resources is roughly:

fetch_team_member_page().map do |team_member|
  account = fetch_account(team_member.account_id)
  render_team_member(team_member,
    account: render_account(account))

Our version looks more like:

team_members = fetch_team_member_page()
bundle = fetch_load_bundle(team_members)
team_members.map do |team_member|
  render_team_member(bundle, team_member,
    account: render_account(bundle, account))

The key difference is that there's no data loaded (e.g. fetch_account) in the loop. Instead, we use Two-phase Load and Render, a technique where all the database records needed to render a set of API resources are loaded in bulk on a single pass, making N+1s difficult to write.

Bulk query patterns

Fetching a page of team members looks exactly how you'd expect (queries have been a simplified for brevity):

SELECT * FROM team_member WHERE team_id = <team_id>;

A set of account IDs is extracted from the result, and a couple more lookups made with them:

  • Select account records for each account ID extracted from team members:
SELECT * FROM account WHERE id = any(<account1>, <account2>, ...);
  • Fetch federated identities for the accounts so that we can populate properties like has_sso.
SELECT * FROM account_federated_identity WHERE id = any(<account1>, <account2>, ...);
  • And likewise, multi factors for setting a value to multi_factor_enabled.
SELECT * FROM multi_factor WHERE id = any(<account1>, <account2>, ...);

It's okay if the details specific to our app are a little fuzzy, but notice broadly that:

  • Like any web app or API, a number of different database models are interwoven to render the final product. We're using only four on this endpoint, but for a complex app, rendering even one page might require the use of hundreds of different models.

  • Lookups make heavy use of id = any(...), where the set being queried might be fairly large. Our API's default page size is 100, so given a full page each any(...) contains 100 account IDs.

And while we're using two-phase load and render, eager loading like found in frameworks like Ruby on Rails will generate similar query patterns.

Inducing load

We'll use the excellent go-wrk to benchmark the API, making sure to do so over a sustained period (60 seconds) to compensate for a cold start and caching.

In a typical web app it's common for database calls to make up the lion's share of the time spent servicing a request, and that's true of our team member list endpoint, but there's a reasonable amount of non-database work happening too. The incoming request is parsed, sent through a middleware stack, its auth checked, telemetry/logging emitted, a response serialized, and so on.

We've left in this extra overhead on purpose. It's possible to demonstrate extreme performance benefits using large quantities of synthetic data combined with carefully crafted queries, but we're trying to demonstrate how the index lookup improvements will benefit a realistic use case.

In pursuit of having a reasonable set of data to test with, I generated a team with 100 (our default page size) team members/accounts along with associated records like federated identities and activated multi factors.

Benchmarked with Postgres 16:

$ go-wrk -d 60 -H 'Authorization: Bearer cbkey_dbGR3HgJkeFyJ8VUXAXeQHlnb5gIlZdoNYoNI51jmCVH6V' -M GET http://localhost:5222/teams/matjsvug6vb7javsjsugxbjtiy/members
Running 60s test @ http://localhost:5222/teams/matjsvug6vb7javsjsugxbjtiy/members
  10 goroutine(s) running concurrently
74272 requests in 59.977486758s, 2.54GB read
Requests/sec:           1238.33
Transfer/sec:           43.35MB
Overall Requests/sec:   1237.71
Overall Transfer/sec:   43.33MB
Fastest Request:        2.427ms
Avg Req Time:           8.074ms
Slowest Request:        147.039ms
Number of Errors:       0
10%:                    2.841ms
50%:                    3.105ms
75%:                    3.206ms
99%:                    3.283ms
99.9%:                  3.285ms
99.9999%:               3.285ms
99.99999%:              3.285ms
stddev:                 3.934ms

And on Postgres 17:

$ go-wrk -d 60 -H 'Authorization: Bearer cbkey_4SgqjRk3B9lcZp7sIb8vWZiJQRtT2MUr4cn7SBapnC2tTX' -M GET http://localhost:5222/teams/matjsvug6vb7javsjsugxbjtiy/members
Running 60s test @ http://localhost:5222/teams/matjsvug6vb7javsjsugxbjtiy/members
  10 goroutine(s) running concurrently
94484 requests in 59.978741362s, 3.23GB read
Requests/sec:           1575.29
Transfer/sec:           55.14MB
Overall Requests/sec:   1574.54
Overall Transfer/sec:   55.12MB
Fastest Request:        1.943ms
Avg Req Time:           6.347ms
Slowest Request:        97.279ms
Number of Errors:       0
10%:                    2.424ms
50%:                    2.713ms
75%:                    2.806ms
99%:                    2.877ms
99.9%:                  2.879ms
99.9999%:               2.88ms
99.99999%:              2.88ms
stddev:                 2.441ms

Highlights in graph form:

The jump from Postgres 16 to 17 shows a ~30% improvement in throughput (1,238 RPS to 1,575 RPS) and 20% drop (8 ms to 6.3 ms) in average request time. That's not full multiples like a synthetic benchmark would produce, but for a real world application, a 20% across-the-board drop in request time is a big deal. There are many, many developers over the years, including this author, who've spent many more hours on optimizations that yielded far less.

The design and implementation of our Go-based API is admittedly pretty bespoke, but I'd expect to see gains not too far off this in applications making heavy use of eager loading in frameworks like Rails.

The flagship features of any new release tend to get the most glory, but these sorts of invisible but highly impactful optimizations are just as good. There's nothing quite like the satisfaction of seeing your entire stack get faster from just pushing an upgrade button!