Query Hugging Face Datasets from Postgres

Looking for Postgres to connect to your data lake? Crunchy Data Warehouse is the latest Postgres-native tool with full Iceberg support.

If you missed the database news lately, you could have missed that we just fused DuckDB with Postgres to build a really fast analytics platform based on Postgres.

There’s so many interesting things you can do with this platform so expect to hear from me again 😉. Today I just want to show off one really simple trick for getting big data sets or training data into Postgres through Hugging Face.

Hugging Face is a community repository of datasets, LLMs, models and other resources for Machine Learning and AI. In addition to all the cool stuff they have going on with ML - the Hugging Face dataset repository has quite a few things that are publicly accessible and ready for data science or use in whatever application or capacity you might dream up.

Query a Hugging Face Table with Postgres

To get us started, we’re using a Crunchy Bridge for Analytics instance.

In order to create a table backed by a Hugging Face dataset, we need to first identify the URL to access the underlying parquet data.

Every Hugging Face dataset gets converted to parquet and is available if you use the proper URL incantation: hf://datasets/<username>/<datasetname>@~parquet/**/*.parquet.

For any username/dataset combination, you will create a foreign table and path like this.

CREATE FOREIGN TABLE imdb_movies () SERVER crunchy_lake_analytics OPTIONS (
	path 'hf://datasets/ShubhamChoksi/IMDB_Movies@~parquet/**/*.parquet'
);

One or many parquet files with wildcard

One really cool thing about this is that you can call a single parquet file or a whole batch with the wildcard. This gives you a lot of options for processing larger batches of data with a single foreign table. Or you can use a single parquet file from the data directory.

Postgres queries on Hugging Face data

I created a foreign table from the IMDB dataset so let’s query this data to see our movie rating trends:

SELECT
	FLOOR(rating) AS whole_number_rating,
	COUNT(name) AS movie_count
FROM
	imdb_movies
GROUP BY 1
ORDER BY 1;

 whole_number_rating | movie_count
---------------------+-------------
                   1 |           8
                   2 |          39
                   3 |         154
                   4 |         459
                   5 |        1193
                   6 |        2084
                   7 |        1477
                   8 |         312
                   9 |          24
                     |         841

Ok, that looks like what I’d expect. Although now I want to watch all 24 movies that are 9+.

Local data analytics

Now the cool thing with having Postgres data locally and other data elsewhere, is that I can use all my SQL tools to do analysis. Let’s do something simple like sentiment analysis on the movie descriptions and ratings.

I can make a function to look for positive and negative words in the text and give each one a sentiment rating.

CREATE OR REPLACE FUNCTION basic_sentiment_analysis(text) RETURNS INTEGER AS $$
DECLARE
    pos_words TEXT[] := ARRAY['good', 'great', 'excellent', 'amazing', 'wonderful', 'positive', 'loved', 'like', 'best'];
    neg_words TEXT[] := ARRAY['bad', 'terrible', 'awful', 'worst', 'negative', 'hated', 'dislike'];
    word TEXT;
    sentiment INTEGER := 0;
BEGIN
    FOR word IN SELECT unnest(string_to_array($1, ' ')) LOOP
        IF word = ANY (pos_words) THEN
            sentiment := sentiment + 1;
        ELSIF word = ANY (neg_words) THEN
            sentiment := sentiment - 1;
        END IF;
    END LOOP;
    RETURN sentiment;
END;
$$ LANGUAGE plpgsql;

And create a local table and insert my sentiment data

CREATE TABLE local_movie_sentiments (
    movie_name TEXT,
    sentiment_score INTEGER
);

INSERT INTO
	local_movie_sentiments (movie_name, sentiment_score)
SELECT
	name,
	basic_sentiment_analysis (movie_info) AS sentiment_score
FROM
	imdb_movies;

And join my local sentiment analysis with the Hugging Face data to get sentiment scores, ratings, and number of movies.

SELECT
	lms.sentiment_score,
	AVG(im.rating) AS average_rating,
	COUNT(*) AS number_of_movies
FROM
	local_movie_sentiments lms
	JOIN imdb_movies im ON lms.movie_name = im.name
GROUP BY
	lms.sentiment_score
ORDER BY
	lms.sentiment_score;

sentiment_score |  average_rating   | number_of_movies
-----------------+-------------------+------------------
              -5 | 5.199999999999999 |                2
              -4 |              4.35 |                4
              -3 |              4.36 |               15
              -2 | 4.973684210526313 |               38
              -1 | 5.743292682926828 |              164
               0 | 6.231737493275953 |             2706
               1 |  6.38562925170068 |             1176
               2 | 6.338143289606467 |              991
               3 | 6.519264705882353 |              682
               4 | 6.548873873873866 |              444
               5 | 6.653658536585366 |              246
               6 | 6.522137404580157 |              131
               7 | 6.811111111111113 |               81
               8 | 6.794594594594596 |               37
               9 | 6.976470588235294 |               34
              10 | 7.051999999999997 |               25
              11 | 6.749999999999999 |               12
              12 | 6.738461538461538 |               13
              13 |              6.25 |                2
              14 | 7.066666666666667 |                3
              16 |               5.6 |                1

One surprising trend here is that the largest number of movies have a neutral sentiment score of 0. There are also very few with a sentiment score with a negative number. So you could draw a conclusion here that there are very few movies with really negative words in descriptions, and no matter what the plot, folks are keeping their descriptions pretty upbeat.

Summary

With Crunchy Bridge for Analytics and the Hugging Face connector:

You can query any public Hugging Face project from Postgres.
You can keep the data remotely and run queries on it.
You can easily join local analysis data with the foreign data.

Latest Articles