Exploring PL/Python: Turn Postgres Table Data Into a NumPy Array
PL/Python can be a convenient and powerful way to take advantage of your PostgreSQL database. In an earlier post, I talked a little bit about how you can use modules included in the Python standard library with PL/Python functions. In this post, we'll try running NumPy in a simple user-defined function which also takes advantage of PL/Python database access functions. The function will show a working example of how to easily convert a data table in Postgres to a NumPy array.
Perhaps you've got a data file and you may be using standard Python modules like csv
and/or functions such as NumPy's genfromtxt()
to ingest the data for analysis. But if you already have the data in Postgres, you have the option to carry out some processing on the database level as well.
In playing around with NumPy in PL/Python I wondered if it was possible to take an entire Postgres table and "transform" it into a NumPy array. For this exercise I used data on red wine quality (available from the UCI Machine Learning Repository), stored in a Postgres table called winequality_r
. Here's a sampling of the table data (total rows = 1599):
winequality_r
id | fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfurdioxide | total_sulfurdioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
2 | 7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |
I wanted this table to end up looking like the following with NumPy - an array with a shape
value of (1599, 12)
:
array([[7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0],
[7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0],
...
])
I figured that I would need to execute some SQL inside the PL/Python function, so here's what I eventually decided to do:
Initialize an empty list, which would eventually become a list of lists that can then be turned into a two-dimensional NumPy array.
new_list = []
Select all table rows from
winequality_r
but exclude the id values.SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM winequality_r
Find a way to append each row as a list within
new_list
. Basically, if I could get the results of thatSELECT
statement to be an iterable, I could presumably convert each row into a list and then add all of them to the "outer"new_list
; perhaps something that looks like this:for r in results: new_list.append(r)
Finally, convert new_list to a NumPy array.
np.asarray(new_list,dtype='float')
Database access functions in plpy
In checking the official docs I saw that PL/Python automatically imports a module called plpy
, which comes with database access functions that allow you to execute Postgres commands. Two options looked promising: plpy.execute()
and plpy.cursor()
. Both take in a SQL query string and return a result object.
According to the docs, calling plpy.execute()
will have the entire result set to be read into memory; on the other hand, plpy.cursor()
returns a cursor object with a fetch method so you can process the result in smaller batches.
At this point, all I needed was to run a simple SELECT
, and it wasn't a massive dataset so I tried plpy.execute()
first. This function returns an object that "emulates a list or dictionary object" - lists and dictionaries are iterables, so it seems like it should work.
So I now have the beginnings of my function:
CREATE OR REPLACE FUNCTION table_to_narray ()
RETURNS numeric[]
AS $$
import numpy as np
new_list = []
wine_r = plpy.execute("SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM winequality_r")
for r in wine_r:
new_list.append(r)
return np.asarray(new_list,dtype='float')
$$ LANGUAGE 'plpython3u';
But running SELECT table_to_narray();
returns an error that indicates I don't quite have the right data type in my list:
ERROR: TypeError: float() argument must be a string or a number, not 'dict'
CONTEXT: Traceback (most recent call last):
PL/Python function "table_to_narray", line 19, in <module>
return np.asarray(new_list,dtype='float')
PL/Python function "table_to_narray", line 491, in asarray
PL/Python function "table_to_narray"
Utility functions in plpy to the rescue
From the error, 'dict'
seemed to refer to the return object from plpy.execute()
. I'll spare you the gory details of the various other things I tried to make the for loop eventually return the list that I wanted, but my colleague Steve Pousty was kind enough to point out plpy
utility functions, which allow PL/Python to send messages and exceptions to the client - helping you debug Python in Postgres!
You'll want to take note of the client_min_messages
runtime config variable: the default level is NOTICE
, but INFO
level messages are always sent to the client as well. I'll go with plpy.info()
here but plpy.notice()
works just fine too.
I wanted to look at what exactly I was getting from plpy.execute()
, so I just added this immediately after the SQL statement execution:
plpy.info(wine_r)
INFO: <PLyResult status=5 nrows=1599 rows=[{'fixed_acidity': Decimal('7.4'), 'volatile_acidity': Decimal('0.70'), … }, … }]>
That does indeed look like something that emulates a dictionary object. I do know that I only want the dictionary values (i.e. 7.4
) and not the dictionary keys or anything else from the result object. PLyResult
isn't well-documented (though you can take a look at the source code), but it does appear that, like the cursor object returned from plpy.cursor()
, when you iterate over PLyResult
you get each table row already in dictionary format. In any case, the built-in dict.values()
method in Python worked well enough in extracting just the values for adding to new_list.
From table to NumPy array in a user-defined function
Here's the version that I was happy with:
CREATE OR REPLACE FUNCTION table_to_narray ()
RETURNS numeric[]
AS $$
import numpy as np
new_list = []
wine_r = plpy.execute("SELECT fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfurdioxide, total_sulfurdioxide, density, ph, sulphates, alcohol, quality FROM public.winequality_r")
for r in wine_r:
# Each row in result is converted to a list that gets appended into outer list
new_list.append(list(r.values()))
# Convert list of lists to 2d numpy array
# But to make the function return SQL arrays, a Python list must be returned
return np.asarray(new_list,dtype='float').tolist()
$$ LANGUAGE 'plpython3u';
table_to_narray
-----------------
{{7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0},{7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5.0},
...
{6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6.0}}
The main thing I'll point out is that I'm not doing anything else to the NumPy array, so for this function to return the result as a SQL array it has to be converted back to a list. This is a rather superficial example; I'd imagine that for a real use case, you'd go on and actually operate on and process the array, eventually returning a different final (and usable!) value with your PL/Python functions.
So this little exercise answered my question: you can indeed create a function that takes a table in Postgres and convert it into a NumPy array. This was a great way to learn a bit about database access and utility functions too. The docs do recommend the plpy.cursor()
method for larger datasets so that may be something for you to keep in mind.
What else might you do with Python embedded in Postgres? It seems like there's a lot of possibilities - for example, my colleague Craig Kerstiens dives into setting up a recommendation engine with Python and Pandas inside Postgres in this blog post. You might also want to give Crunchy Bridge a try: Crunchy Data's new Postgres cloud service that include PL/Python (and packages like Pandas and NumPy) out of the box. In any case - if you have ideas, or are already doing something interesting with PL/Python that you'd like to share, feel free to leave a comment and let us know!
Related Articles
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read