Query Hugging Face Datasets from Postgres
If you missed the database news lately, you could have missed that we just fused DuckDB with Postgres to build a really fast analytics platform based on Postgres.
There’s so many interesting things you can do with this platform so expect to hear from me again 😉. Today I just want to show off one really simple trick for getting big data sets or training data into Postgres through Hugging Face.
Hugging Face is a community repository of datasets, LLMs, models and other resources for Machine Learning and AI. In addition to all the cool stuff they have going on with ML - the Hugging Face dataset repository has quite a few things that are publicly accessible and ready for data science or use in whatever application or capacity you might dream up.
Query a Hugging Face Table with Postgres
To get us started, we’re using a Crunchy Bridge for Analytics instance.
In order to create a table backed by a Hugging Face dataset, we need to first identify the URL to access the underlying parquet data.
Every Hugging Face dataset gets converted to parquet and is available if you use the proper URL incantation: hf://datasets/<username>/<datasetname>@~parquet/**/*.parquet
.
For any username/dataset combination, you will create a foreign table and path like this.
CREATE FOREIGN TABLE imdb_movies () SERVER crunchy_lake_analytics OPTIONS (
path 'hf://datasets/ShubhamChoksi/IMDB_Movies@~parquet/**/*.parquet'
);
One or many parquet files with wildcard
One really cool thing about this is that you can call a single parquet file or a whole batch with the wildcard. This gives you a lot of options for processing larger batches of data with a single foreign table. Or you can use a single parquet file from the data directory.
Postgres queries on Hugging Face data
I created a foreign table from the IMDB dataset so let’s query this data to see our movie rating trends:
SELECT
FLOOR(rating) AS whole_number_rating,
COUNT(name) AS movie_count
FROM
imdb_movies
GROUP BY 1
ORDER BY 1;
whole_number_rating | movie_count
---------------------+-------------
1 | 8
2 | 39
3 | 154
4 | 459
5 | 1193
6 | 2084
7 | 1477
8 | 312
9 | 24
| 841
Ok, that looks like what I’d expect. Although now I want to watch all 24 movies that are 9+.
Local data analytics
Now the cool thing with having Postgres data locally and other data elsewhere, is that I can use all my SQL tools to do analysis. Let’s do something simple like sentiment analysis on the movie descriptions and ratings.
I can make a function to look for positive and negative words in the text and give each one a sentiment rating.
CREATE OR REPLACE FUNCTION basic_sentiment_analysis(text) RETURNS INTEGER AS $$
DECLARE
pos_words TEXT[] := ARRAY['good', 'great', 'excellent', 'amazing', 'wonderful', 'positive', 'loved', 'like', 'best'];
neg_words TEXT[] := ARRAY['bad', 'terrible', 'awful', 'worst', 'negative', 'hated', 'dislike'];
word TEXT;
sentiment INTEGER := 0;
BEGIN
FOR word IN SELECT unnest(string_to_array($1, ' ')) LOOP
IF word = ANY (pos_words) THEN
sentiment := sentiment + 1;
ELSIF word = ANY (neg_words) THEN
sentiment := sentiment - 1;
END IF;
END LOOP;
RETURN sentiment;
END;
$$ LANGUAGE plpgsql;
And create a local table and insert my sentiment data
CREATE TABLE local_movie_sentiments (
movie_name TEXT,
sentiment_score INTEGER
);
INSERT INTO
local_movie_sentiments (movie_name, sentiment_score)
SELECT
name,
basic_sentiment_analysis (movie_info) AS sentiment_score
FROM
imdb_movies;
And join my local sentiment analysis with the Hugging Face data to get sentiment scores, ratings, and number of movies.
SELECT
lms.sentiment_score,
AVG(im.rating) AS average_rating,
COUNT(*) AS number_of_movies
FROM
local_movie_sentiments lms
JOIN imdb_movies im ON lms.movie_name = im.name
GROUP BY
lms.sentiment_score
ORDER BY
lms.sentiment_score;
sentiment_score | average_rating | number_of_movies
-----------------+-------------------+------------------
-5 | 5.199999999999999 | 2
-4 | 4.35 | 4
-3 | 4.36 | 15
-2 | 4.973684210526313 | 38
-1 | 5.743292682926828 | 164
0 | 6.231737493275953 | 2706
1 | 6.38562925170068 | 1176
2 | 6.338143289606467 | 991
3 | 6.519264705882353 | 682
4 | 6.548873873873866 | 444
5 | 6.653658536585366 | 246
6 | 6.522137404580157 | 131
7 | 6.811111111111113 | 81
8 | 6.794594594594596 | 37
9 | 6.976470588235294 | 34
10 | 7.051999999999997 | 25
11 | 6.749999999999999 | 12
12 | 6.738461538461538 | 13
13 | 6.25 | 2
14 | 7.066666666666667 | 3
16 | 5.6 | 1
One surprising trend here is that the largest number of movies have a neutral sentiment score of 0. There are also very few with a sentiment score with a negative number. So you could draw a conclusion here that there are very few movies with really negative words in descriptions, and no matter what the plot, folks are keeping their descriptions pretty upbeat.
Summary
With Crunchy Bridge for Analytics and the Hugging Face connector:
- You can query any public Hugging Face project from Postgres.
- You can keep the data remotely and run queries on it.
- You can easily join local analysis data with the foreign data.
Related Articles
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read