Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Scaling Vector Data with Postgres

Avatar for Christopher Winslett

Christopher Winslett

12 min read

Note: We have additional articles in this Postgres AI series.

Vector data has made its way into Postgres and I’m seeing more and more folks using it by the day. As I’ve seen use cases trickle in, I have been thinking a lot about scaling data and how to set yourself up for performance success from the beginning. The two primary trade-offs are performance versus accuracy. When seeking performance with vector data, we are using nearest neighbor algorithms, and those algorithms are built around probability of proximity. If your use-case requires 100% accuracy on nearest neighbor, performance will be sacrificed.

After choosing between performance versus accuracy, the next tools in the toolbox are caching and partitioning. Caching is obvious in some situations, if your product is finding “similar meals” or “similar products” or “similar support questions”, then the similarities will not change rapidly.

For the most part, the keys to scaling AI data are the same as scaling any other data type: reduce the number of rows in index and reduce the concurrent queries hitting the database. Once the index has done its work, CPU becomes the primary constraint: how fast can you calculate and compare distances between vectors? Scaling vector data is currently about performance mitigation as much as it is overpowering the data.

In the next few weeks, the Postgres pg_vector extension is launching HNSW indexes (see the commit history for pgvector loaded with HNSW commits)— a graph-based algorithm for finding nearest neighbor. When HNSW indexes launch, these scaling practices will continue to be applicable. HNSW has better performance and scale than the current list-based indexes, but it too incurs costs in terms of storage and build time.

Physical separation of data

My biggest advice here is to use a physically separate database for your vector data. Store your application data on one database, and store vector, lookup values and filter values on a separate database. You can easily connect the two with the Postgres foreign data wrapper.

Why separate your AI data? It gives you more flexibility when working with vector data out of AI models. Should you retrain your model, the vector output will be different, and you’ll need to rebuild the data and rebuild the indexes. When you want to experiment with the soon-to-launch HNSW indexing strategy, for example, you can fork your Postgres database and build the new index.

Vector data has sizable RAM requirements, so you can keep your operational Postgres database and vector databases on appropriately sized instances. We’ve recommended physical separation of data previously in Thinking Fast and Slow with your Data.

Performance first

If your use case can benefit from nearest-neighbor vector queries, then start with performance before scaling — concentrate on finding the right tuning for the index. To help with that process, I wrote about making vector data performant. Query tuning sounds simple, but just make sure your databases are using the available indexes — small changes to a query will change the inference of the query optimizer, which will not use the index. When it comes to vector queries, keep it simple.

Small setting changes to the index’s list values may affect query performance. The time spent tuning the vector data will be much quicker if you are working with physically separate data from the application.

Categorizing and Separating Data

Separating data can be done with conditional logic or it can be done at the schema level. For logical separation one easy way to do this is to tag data and index those tags. Then, indexes are more constrained and already have predefined similarities — like maybe a tag with the value “dessert” for restricting the nearest neighbor to other desserts.

In the code samples, we are going to build on the code-samples that we previously used. The entire code-base can be found here: https://github.com/CrunchyData/Postgres-AI-Tutorial/.

To setup the walk-through, first, clone the github project using git clone git@github.com:CrunchyData/Postgres-AI-Tutorial.git. Then, create a Postgres database with pgvector extension included -- if you build your Postgres on Crunchy Bridge, I can guarantee that you will have the pgvector extension.

Next, load the recipe-tracker.sql file into your Postgres database. Below, we use connection url for all connections, checkout our guide to Postgres connection URLs.

bash> cat recipe-tracker.sql | psql 'postgres://username:password@host:port/database'

Now, in the Github repo, I’ve created a script that uses a KMeans algorithm to find the 30 different clusters of meals from the data. To run the categorization file (see the full file here), run the following:

bash> pip install psycopg2 sklearn
bash> DATABASE_URL=postgres://username:password@host:port/database python categorizer.py

When running, it will return all the vectors in the database, and organize them into 6 different clusters. Iteratively, it will ask “how would you categorize the following?”, to which supply a tag, and it will update the respective rows with those categories.

Name one of your categories as “entree”, as we will use that in the next code example.

This is an aside, but the combination of OpenAI embedding data + the ease of k-means clustering in Python is 🧑‍🍳😘. Check out the categorizer.py file in the Github repo above to see how easy it is. Experiment change values in the file to see how cluster changes — change n_clusters on line 26 to have the algorithm generate more clusters of recipes.

Separation by index - Code Sample

Separating vector indexes based on conditionals using partial indexes is as you’ve used in other situations. If you wanted to index based on one of your customer’s ids, you can do the following:

CREATE INDEX recipes_embedding_category_entree ON recipes
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100)
WHERE category = 'entree';

In my test, I reduced the index size from 720 to 288 rows. On my machine, this reduces the size of the index from 6.5MB to 3.1MB. As expected, reduction of rows is linear. Given the above categorization process has many moving parts, expect a difference.

Then, when you run the following query, you’ll see that the index is used:

EXPLAIN
SELECT
	*
FROM recipes
WHERE category = 'entree'
ORDER BY embedding <-> (SELECT embedding FROM recipes WHERE id = 151)
LIMIT 1;

Where B-Tree supports multi-column indexes, vector indexes do not. When using conditionals, you’ll need to create indexes for each category. But, given different data sizes will need different lists sizes, you’ll need to use an algorithm for each value in the category to match the data size for the specific customer.

Use different tables - Code Sample

But, if you are running a multi-tenant database, why would you build data as such? Instead, you could use multiple tables, create tables to house the vector data for specific customers:

CREATE TABLE recipes_customer_1 (
	recipe_id BIGINT,
	embedding vector(1555)
)

CREATE INDEX ON recipes_customer_1 USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

This would allow you to better measure the performance of each table, and thus each customer quite easily using pg_stat_statement. For instance, if you saw performance issues with recipes_customer_2, you can attempt to target the performance issues for that specific customer. As with conditional partitioning, separating by database results in a linear change in resource allocation based on sizes.

Caching

Caching can be pre-caching or post-query caching.

  • Pre-caching is caching at storage time.
  • Post-query caching is caching after running a query for a customer.

If you have an application that delivers the same recipes to all customers each week, then pre-caching each recipe at time of creation is pretty obvious. Find the nearest 5 - 10 recipes, then recommend. Use, post-query caching if you can’t anticipate the queries that will be run. Run the query, find the result, store the result for the next request.

For caching of these types of recommendation systems, store the data in Postgres on a lookup table in the application database. Then, track views and down votes to those recommendations, which can then be sorted later for refinement of future recommendations. Using this method, it’s not so much a naive cache system as a cache system that can learn from behaviors.

The largest implementations will have a combination of partitioning and caching. But, given a limited amount of time, which should you do first? It all depends. If I were serving a consumer app that prioritizes quick responses with close-accuracy, I would favor post-query caching to mitigate performance. If I were serving rapidly changing data on a multi-tenant application that prioritizes precision, I would favor partitioning over caching.

Rank your customer’s priorities when making the choice.

Pre-query Caching - Code Sample

A pre-query cache can be as simple as a lookup table. Below, we’ll create the lookup table:

CREATE TABLE recipe_recommendations (
	recipe_id integer not null,
	closest_recipe_id integer not null,
	rank integer not null
);

Then, we can use SQL’s merge command to populate this table:

WITH ranked_closest_recipes AS (
  SELECT
    original_recipes.id AS original_recipe_id,
    original_recipes.name AS original_recipe_name,
    closest_recipes.id AS closest_recipe_id,
    closest_recipes.name AS closest_recipe_name,
    rank
  FROM recipes AS original_recipes,
    LATERAL (
      SELECT
        compared_recipes.id,
        compared_recipes.name,
        RANK() OVER (ORDER BY compared_recipes.embedding <-> original_recipes.embedding) AS rank
      FROM recipes AS compared_recipes
      WHERE compared_recipes.id != original_recipes.id
      ORDER BY compared_recipes.embedding <-> original_recipes.embedding
      LIMIT 5
    ) AS closest_recipes
)

-- use merge command to find and insert new records
MERGE INTO recipe_recommendations AS target
  USING ranked_closest_recipes AS source
ON target.recipe_id = source.original_recipe_id AND target.rank = source.rank
WHEN NOT MATCHED THEN
  INSERT (recipe_id, closest_recipe_id, rank)
  VALUES (source.original_recipe_id, source.closest_recipe_id, source.rank)
WHEN MATCHED THEN
  UPDATE SET closest_recipe_id = source.closest_recipe_id;

Similar to the queries above, we are using LATERAL JOIN to find the 5 closest recipes. Then, we use merge to upsert values into the recipe_recommendations table. Instead of relying on vector data, now we are relying on classic B-tree indexes for optimization. To find the 5 closest recipes, it’s as simple as the following:

SELECT
	recipes.name
FROM recipe_recommendations
INNER JOIN recipes ON recipe_recommendations.closest_recipe_id = recipes.id
WHERE recipe_id = 151
ORDER BY rank;

For additional optimizations, add indexes to the recipe_recommendations table.

Dimensional Reduction

Since performance of vector search binds on CPU usage, a simple way to speed things up is just to shrink the size of the vectors. Comparing 50-dimensional vectors will be faster than comparing 1000-dimensional vectors.

When working with OpenAI, vector embeddings are a set of 1536 values. Those 1536 values represent the universe of potential differences. But, if you are just working with say food recipes, then 1536 values is not really that necessary. Instead, you could find a limited number of values in the set with the largest range, and reduce the number of values.

Dimensional Reduction - Code Sample

Below, we will extract the 50 vectors with the largest range. This is naive because we assume that the 50 vectors with the largest range will have the greatest impact on nearest neighbor distance. If there are a total of 1536 vectors in OpenAI embeddings, then all of them are not valuable, but let’s see how the top 50 shake out.

The Github project has a Python script that will compact the vectors properly. To run the file on the sample SQL data, execute the following:

DATABASE_URL=postgres://localhost:5432/postgres python reducer.py

The Python code for the compressor looks like this (see the full file here):

# Initialize vector_summation list
vector_summation = []

cursor.execute("SELECT id, embedding FROM recipes WHERE embedding IS NOT NULL")

# Fetch data from the recipes table
for row in cursor:
    recipe_id, embedding = row[0], row[1]

    # Evaluate the embedding string into a list of vectors
    vectors = ast.literal_eval(embedding)

    # Iterate through vectors and update vector_summation with the id, min, and max
    for index, vector in enumerate(vectors):
        if len(vector_summation) <= index:
            vector_summation.append({'index': index})

        vector_summation[index]['max'] = max(vector_summation[index].get('max', vector), vector)
        vector_summation[index]['min'] = min(vector_summation[index].get('min', vector), vector)

# Sort vector_summation by max - min difference and select the last 50 and 10 elements
vector_summation.sort(key=lambda x: x['max'] - x['min'])
vector_index = [v['index'] for v in vector_summation[-50:]]
super_compressed = [v['index'] for v in vector_summation[-10:]]

# Update recipes with compacted_embedding and super_compacted_embedding
cursor.execute("SELECT id, embedding FROM recipes WHERE python_compacted_embedding IS NULL AND embedding IS NOT NULL")

# Open an update cursor so we can update while keeping the select cursor open
update_cursor = conn.cursor()

for row in cursor:
    recipe_id, embedding = row[0], row[1]
    vectors = ast.literal_eval(embedding)

    compacted_vector = [vectors[index] for index in vector_index]
    super_compressed_vector = [vectors[index] for index in super_compressed]

Now, let’s add an index on the vectors:

CREATE INDEX ON recipes
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

CREATE INDEX ON recipes
USING ivfflat (compacted_embedding vector_l2_ops)
WITH (lists = 100);

CREATE INDEX ON recipes
USING ivfflat (super_compacted_embedding vector_l2_ops)
WITH (lists = 100);

Now, let’s look at the index sizes by running \di+, and we’ll see that the compacted embedding for this use case has a 800kb index size, and the full embedding has a 6.5MB index size. While smaller, due to the nature of the index, the drop in size is not directly proportional as it is affected by list count. The super compacted index will fail to build because it does not have enough data.

What’s the trade-off? We’ve shrunk the index, but how much fidelity did we lose? Let’s compare the top 5 nearest neighbor recipes to chocolate chip cookies using the embedding and the compacted_embedding field. When running the following query on the embedding with 1536 vectors, the results are below:

SELECT
	name
FROM recipes
ORDER BY embedding <-> (SELECT embedding FROM recipes WHERE id = 151)
LIMIT 5;
name
--------------------------
 Cookies, chocolate chip
 Cookies, crisp chocolate
 Cookies, chocolate drop
 Bar, toffee, crisp
 Cookies, peanut butter
(5 rows)

Then, when running the query on the following compacted embedding, we get similar results:

SELECT
	id, name
FROM recipes
ORDER BY compacted_embedding <-> (SELECT compacted_embedding FROM recipes WHERE id = 151)
LIMIT 5;
name
--------------------------
 Cookies, chocolate chip
 Cookies, crisp chocolate
 Cookies, chocolate drop
 Cookies, coconut cereal
 Cookies, peanut butter
(5 rows)

If your results are like mine, there may be a few trade-offs between the top-5. Not all results will have equal trade-offs, so it is up to you to determine if this fits your use case. Broad datasets will have more detail in smaller differences. Common datasets, like recipes, will have the majority of difference in the vectors with the largest ranges.

Conclusion

There are a few paths to scaling vector data. Start with the end in mind. Pencil in where your use-case lands on the path between accuracy and performance. First, seek performance, then seek scale. The toolbox is a common one.

As you start building and scaling your pg_vector database:

  • Physically separate the vector db from your other production resources. They have different resource requirements.
  • Look at query tuning and performance.
  • Think about easy ways to categorize and separate data, separate at the table level if possible.
  • Add indexing for your categorizations.
  • Use a lookup table for caching.
  • Look at dimensional reduction to reduce the overall data set.