Postgres Data Types in PL/Python
I've been digging a little bit into PL/Python recently-inspired by our recent launch of Crunchy Bridge: a multi-cloud Postgres service with PL/Python and PL/R built right in. One thing that has tripped me up a few times is how PostgreSQL data types work with Python - especially when using external packages like NumPy or SciPy.
Data type mapping is something you'll deal with when working with different languages or trying to get separate systems to talk to one another. Mapping types isn't just related to PL/Python, in fact one colleague once wrote a program to interoperate from Ruby to Python in C to make some of such things easier. In the available procedural languages in Postgres, there's going to be at least some documentation on what data types in that language correspond to which types in Postgres (and vice versa). For PL/Python, there are a few differences to note in the mappings for Python 2 and Python 3; I'd recommend studying the docs as they're pretty comprehensive. Just a quick reminder that Python 2.7 end-of-life was earlier in 2020, so the code samples here (as well as our other posts on PL/Python) will be using Python 3 (plpython3u
).
With all of that said, I'll share a few interesting things I encountered in PL/Python data type mapping, especially when I added NumPy and SciPy to the picture. I'll assume you have Python knowledge and are also familiar with Postgres user-defined functions, but if not I've also made sure to link to documentation along the way.
Arrays in NumPy/Python vs Postgres
Like I mentioned in an earlier post, you can use external packages in PL/Python as long as it's on the same host as Postgres and your Python installation. Let's do a quick, basic example of a function that has input parameters and uses NumPy:
CREATE OR REPLACE FUNCTION create_array_from_range (start integer, stop integer)
RETURNS int[]
AS $$
import numpy as np
return np.arange(start, stop).tolist()
$$ LANGUAGE 'plpython3u';
The arange()
function in NumPy returns a one-dimensional NumPy array, which behaves like built-in container types in Python such as lists and dictionaries. The PL/Python docs on array types state that if you want to get a Postgres array out of the PL/Python function, you should return a Python list. So we're also adding in the tolist()
function to ensure that we're returning a list value and ultimately an array back out to Postgres.
Let's see what happens when we use this function with values from a sample table int_table
:
int_table
id | col1 | col2 |
---|---|---|
1 | 1 | 7 |
2 | 5 | 56 |
3 | 99 | 107 |
SELECT create_array_from_range(col1, col2) FROM int_table;
create_array_from_range
----------------------------------------
{3,4,5,6}
{5,6,7, … ,53,54,55}
{99,100,101, ... ,104,105,106}
We get a one-dimensional array for each row. Working with multi-dimensional arrays is fine too - the "inner" arrays on the same level have to be of the same size, and again you'll want to convert the entire array into a nested list.
Passing table columns as array arguments
You can also pass in entire table columns as arrays to the function. As you might expect, Postgres arrays are likewise passed into PL/Python as Python lists. In the next function example we'll work with data on red wine quality saved in a table called winequality_r
, and this time we'll also play around with SciPy, just for the sake of demonstration.
This data is available from the UCI Machine Learning Repository. The table has columns that store numeric data on attributes of 1599 red wines such as sulphates, residual sugar, ph, etc, and one column for the "quality" score for that wine, in integers from 0 to 10.
As a disclaimer, I'm not a data scientist nor analyst, but I do like wine (for reds I like cabernets, and you can never go wrong with Spanish wines; something that I also recently discovered but haven't had much luck finding locally are orange wines). So I thought I'd try my hand at some simple correlation testing between a given physicochemical attribute and the quality of the wine:
CREATE OR REPLACE FUNCTION correlation_test (attribute_col numeric[], quality_col int[])
RETURNS text
AS $$
import numpy as np
from scipy.stats import pearsonr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
return pearsonr(attribute, quality)
$$ LANGUAGE 'plpython3u';
If I wanted to see whether there was a correlation between alcohol content and quality:
SELECT correlation_test(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
) FROM winequality_r;
correlation_test
---------------------------------------------
(0.4761663239992742, 2.8314769799724706e-91)
In addition to using array_agg()
, you could use the ARRAY
constructor as well to pass arrays into the function:
SELECT correlation_test(
(SELECT ARRAY(SELECT alcohol FROM winequality_r ORDER BY id)),
(SELECT ARRAY(SELECT quality FROM winequality_r ORDER BY id))
);
The SciPy pearsonr()
function takes in two input arrays - but not Postgres arrays (that is, lists in Python). So the Postgres arrays are converted to NumPy arrays first using the array()
function, before we can call pearsonr()
. The output is the Pearson correlation coefficient and p-value between the two arrays.
Out of curiosity, I used the utility function plpy.info()
to check what data type was being returned:
plpy.info(type(pearsonr(attribute, quality)))
INFO: <class 'tuple'>
I've just set the function return type as text
but you could of course choose another return type such as an array (which should be compatible with Python tuples). In the next example, we'll try returning the result as multiple values, i.e. into multiple columns, out of PL/Python.
Output parameters to return composite results
User-defined functions in Postgres can also be defined with output parameters, which means that the function returns those parameters as part of the result.
1. Let's call an additional SciPy function
To add on to our example, let's say we want to make our correlation_test function above do a little bit more: in addition to giving us the Pearson correlation coefficient and corresponding p-value, we'll also get it to calculate the Spearman correlation coefficient and p-value between two datasets. In SciPy, we can use the spearmanr()
function for exactly that.
The interesting thing is that when you return only the spearmanr()
result as text, the value looks like a different type from what we got with pearsonr()
:
# (PL/Python function body)
import numpy as np
from scipy.stats import spearmanr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
return spearmanr(attribute, quality)
SpearmanrResult(correlation=0.47853168747024344, pvalue=2.7268377398474203e-92)
And if we combine the two expressions for the return value (i.e. return pearsonr(attribute, quality) + spearmanr(attribute, quality)
), it looks like the pearsonr()
tuple again:
(0.4761663239992742, 2.8314769799724706e-91, 0.47853168747024344, 2.7268377398474203e-92)
I tried to double-check the spearmanr()
return type with plpy.info()
but it wasn't much help for me:
INFO: <class 'scipy.stats.stats.SpearmanrResult'
It doesn't seem that SpearmanrResult
is well-documented, but my googling seems to indicate that this is an example of a Python named tuple. A named tuple is comprised of fieldname=value
pairs. And, since it's still a tuple, you can concatenate it with another type to form a new tuple, which is why the +
operator in the return expression above works and just returns all four values in one new tuple.
2. Let's use a dictionary to return multiple values
So we get all four statistics in one very simple function, which is all well and good - but you might be thinking, "what if I need to send this value to a client application, and/or an app developer or analyst needs to work with this value?" Sending a string that doesn't explicitly indicate all of the elements it contains isn't particularly helpful.
Let's go ahead and modify our function:
CREATE OR REPLACE FUNCTION correlation_test (attribute_col numeric[], quality_col int[], OUT pearson_c double precision, OUT pearson_pv double precision, OUT spearman_c double precision, OUT spearman_pv double precision)
AS $$
import numpy as np
from scipy.stats import spearmanr, pearsonr
attribute = np.array(attribute_col,dtype='float')
quality = np.array(quality_col,dtype='float')
pearson_result = pearsonr(attribute, quality)
spearman_result = spearmanr(attribute, quality)
result = {}
result['pearson_c'] = pearson_result[0]
result['pearson_pv'] = pearson_result[1]
result['spearman_c'] = spearman_result[0]
result['spearman_pv'] = spearman_result[1]
return result
$$ LANGUAGE 'plpython3u';
We store the Pearson tuple and Spearman named tuple in two separate variables, then initialize an empty dictionary and populate it with individual key-value pairs for the four data points. While it's not explicit in the PL/Python docs, my guess is that a Python object would work for this scenario as well.
If we try calling the function like so:
SELECT correlation_test(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
) FROM winequality_r;
It still looks like the result from the earlier version:
correlation_test
--------------------------------------------------------------------------------------
(0.4761663239992742,2.8314769799724706e-91,0.47853168747024344,2.7268377398474203e-92)
But the output is now actually a record (and if you run the query with a GUI like pgAdmin it'll indicate that).
All we need to do to return this record
in multiple values is to write the SELECT
statement like this:
SELECT (correlation_test_tuple(
array_agg(alcohol ORDER BY id),
array_agg(quality ORDER BY id)
)).* FROM winequality_r;
(Note that the function call itself is wrapped in parentheses.)
pearson_c | pearson_pv | spearman_c | spearman_pv
--------------------+------------------------+---------------------+------------------------
0.4761663239992742 | 2.8314769799724706e-91 | 0.47853168747024344 | 2.7268377398474203e-92
This way, we have our statistics laid out nicely in a table-like format with multiple columns. This will make it a bit easier in your application code as well to just reference individual data points if that's what you need.
The result dictionary keys in the function match the OUT parameters in the function definition. That's how Postgres knows which dictionary value maps to which parameter (and in which column they go).
From NumPy, to standard Python, to Postgres… Oh my!
To sum up what we covered in terms of what you might possibly deal with in data type mapping in PL/Python:
- Stick to converting arrays to Python lists if you want to ultimately get back a Postgres array when calling your PL/Python function.
plpy
utility functions are your friend and can help you uncover a little bit of how data is processed in your PL/Python code (especially if you're using external Python packages). I've also started digging a bit intoplpy
functions if you're keen on reading more.- You can take advantage of SciPy and NumPy for processing data in your function and get back multiple data points as output, and still make it easy to parse using output parameters.
Keen on trying this all out yourself? Crunchy Bridge is the only cloud PostgreSQL service that includes PL/Python and NumPy and SciPy (among others!). Create an account and let us know what you think.
Related Articles
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read
- Smarter Postgres LLM with Retrieval Augmented Generation
6 min read
- Postgres Partitioning with a Default Partition
16 min read
- Iceberg ahead! Analyzing Shipping Data in Postgres
8 min read