Ruby on Rails Neighbor Gem for AI Embeddings
Over the past 12 months, AI has taken over budgets and initiatives. Postgres is a popular store for AI embedding data because it can store, calculate, optimize, and scale using the pgvector extension. A recently introduced gem to the Ruby on Rails ecosystem, the neighbor gem, makes working with pgvector and Rails even better.
Background on AI in Postgres
An “embedding” is a set of floating point values that represent the characteristics of a thing (nothing new, we’ve had these since the 70s). Using the OpenAI API or any of their competitors, you can send over blocks of text, images, and pdfs, and OpenAI will return an embedding with 1536 values representing the characteristics. With the pgvector
extension, you can store that embedding in a vector column type on Postgres. Then, using nearest neighbor calculations, you can then find the most-similar objects. For a deeper review of AI with Postgres, see my previous posts in this series.
The neighbor gem
By default, Ruby on Rails does not know about the "vector" data type. If you've used Ruby on Rails + Postgres + pgvector, you've probably written SQL queries in your migrations, and implemented some other janky-code. The neighbor gem will remove the janky-code, and take you back to a native ActiveRecord experience.
At a minimum, all you have to do is add the following to you Gemfile
:
gem 'neighbor'
Side note: I can't overstate the impact Andrew Kane has had on embedding data in Postgres. He's also making it easy for developers to use those vector data types with Ruby on Rails and Node.
Fixed schema dump
The biggest risk of not using Neighbor is that ActiveRecord will create a failing db/schema.rb
file. Because ActiveRecord does not understand the vector
data type, instead of failing, running rails db:schema:dump
will omit any table with that data type. It will show this error in your db/schema.rb
:
# Could not dump table "recipe_embeddings" because of following StandardError
# Unknown type 'vector(1536)' for column 'embedding'
With Neighbor, you'll get a fully-functional schema like the following:
create_table "recipe_embeddings", primary_key: "recipe_id", id: :bigint, default: nil, force: :cascade do |t|
t.vector "embedding", limit: 1536, null: false
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["embedding"], name: "recipe_embeddings_embedding", opclass: :vector_l2_ops, using: :hnsw
t.index ["recipe_id"], name: "index_recipe_embeddings_on_recipe_id"
end
Notice that Neighbor also understands the []hnsw
index type](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector) released with pgvector 0.5.
Side note: for projects that go all-in on Postgres, I opt to use the following to dump to a db/structure.sql
:
SCHEMA_FORMAT=sql rails db:schema:dump
Easier migrations + data type handling
Without Neighbor, ActiveRecord is not informed of vector. Just as your config/schema.rb
file is important for your typical migration would look something like the following:
create_table :recipe_embeddings, primary_key: [:recipe_id] do |t|
t.references :recipe, null: false, foreign_key: true
t.vector :embedding, limit: 1536, null: false
t.timestamps
end
Additionally, you get improved handling of the vector data type. Without Neighbor, working with embedding data required to_s
to manipulate the values when inserting into Postgres. But, with Postgres, it's simplifies to a native process:
RecipeEmbedding.create!(recipe_id: Recipe.last.id, embedding: [-0.078427136, 0.0014401458, ...])
But, wait! There's more …
The nearest_neighbor
method
After you add the embedding
column to a table, you can use has_neighbors
to define your nearest neighbor queries:
class RecipeEmbedding < ApplicationRecord
has_neighbors :embedding
end
Then, you can find the nearest neighbors like so:
recipe_embedding.nearest_neighbors(:embedding, distance: "euclidean").first
The distance calcuations include euclidean
and cosine
.
Conclusion
Launching a project to use embeddings with Ruby on Rails?
Step 1: use the neighbor gem
Step 2: provision your database on Crunchy Bridge with pgvector
Step 3: profit
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read