Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Holy Sheet! Remote Access CSV Files from Postgres

Avatar for Paul Ramsey

Paul Ramsey

7 min read

An extremely common problem in fast-moving data architectures is providing a way to feed ad hoc user data into an existing analytical data system.

Do you have time to whip up a web app? No! You have a database to feed, and events are spiraling out of control... what to do?

How about a Google Sheet? The data layout is obvious, you can even enforce things like data types and required columns using locking and protecting, and unlike an Excel or LibreOffice document, it's always online, so you can hook the data into your system directly.

Access Sheets Data Remotely

You can pull data in CSV format from a (public) Google Sheets workbook just by plugging the sheet ID into a magic access URL.

The URL format looks like this:

https://docs.google.com/spreadsheets/d/{sheetId}/gviz/tq?tqx=out:csv&sheet={sheetName}

The sheetId you can pull out of the URL at the top of your browser. The sheetName is at the bottom of the page. You can read one sheet at a time out of a Sheets workbook.

This example sheet has 22 rows of population projection data in it.

$ curl "https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections"

"Year","18 to 19","Total"
"2024","120107","5485084"
"2025","123484","5563798"
"2026","128627","5641925"
"2027","132540","5719109"
"2028","134067","5796302"
...

Remote Access with COPY

The following examples all use the Crunchy Bridge database-as-a-service. Not all services will support these methods.

To use the COPY command for remote loading you will first need to create a table to load the data into. Our table structure is just three integer columns.

CREATE TABLE popn_copy (
year integer,
age_18_to_19 integer,
all_ages integer
);

We are going to use the PROGRAM option of copy to fire up the curl utility and pull the CSV data from Google, then stream that into the default PostgreSQL CSV reader.

In order to use COPY with the PROGRAM option you must be logged in as the postgres superuser.

COPY popn_copy FROM PROGRAM
'curl "https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections"'
WITH (
 FORMAT csv,
 HEADER true,
 ENCODING utf8
 );

Just like that, 22 rows loaded!

SELECT Count(*) FROM popn_copy;

The COPY approach is really the simplest one available, and to refresh your data, you can just run a scheduled TRUNCATE and then re-run the COPY.

However, it does have the disadvantage of requiring a superuser login.

Remote Access with HTTP

The http extension for PostgreSQL allows users to run web requests and fetch data from any URL. Sounds like exactly what we need!

You can check if you have the extension by querying the pg_available_extensions table.

SELECT *
  FROM pg_available_extensions
  WHERE name = 'http';
name              | http
default_version   | 1.5
installed_version |
comment           | HTTP client for PostgreSQL, allows web page retrieval inside the database.

If you have it, enable the http extension, and create a target table:

CREATE EXTENSION http;

CREATE TABLE popn_http (
year integer,
age_18_to_19 integer,
all_ages integer
);

If we use the http_get() function, we can pull the content from the remote URL in one step.

SELECT content AS row
FROM http_get('https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections')

What we get back is all the data, but in one big string. We would prefer 22 rows of data. Fortunately, PostgreSQL string processing can help us condition the data before inserting it into our table.

First, cut the string up using new-line characters as a delimiter.

SELECT unnest(string_to_array(content, E'\n')) AS row
FROM http_get('https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections')

Now we have 23 rows (22 data rows and one header row), which we can parse the numeric pieces out of using regexp_match():

INSERT INTO popn_http
WITH rows AS (
  SELECT unnest(string_to_array(content, E'\n')) AS row
  FROM http_get('https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections')
),
cols AS (
  SELECT regexp_match(row, '"([0-9]+)","([0-9]+)","([0-9]+)"') AS col FROM rows
)
SELECT col[1]::integer AS year,
       col[2]::integer AS age_18_to_19,
       col[3]::integer AS all_ages
  FROM cols
 WHERE col[1] IS NOT NULL

Whenever you want to refresh, just TRUNCATE the table and re-run the population query. Unlike the COPY method, this doesn't require super-user access to implement.

Remote Access with PL/Python

The http extension is simple, but it leaves a lot of work on the server side. PL/Python is capable of doing remote HTTP access, and it also has a lot of nice string parsing tools that could maybe result in a more pleasant output.

CREATE EXTENSION IF NOT EXISTS plpython3u;

CREATE OR REPLACE FUNCTION read_csv_from_url(url TEXT)
RETURNS SETOF RECORD
AS $$
  import csv
  import requests
  from io import StringIO

  # Make a GET request to the URL and retrieve the CSV data
  response = requests.get(url)
  response.raise_for_status()
  csv_data = response.text

  # Parse the CSV data
  reader = csv.reader(StringIO(csv_data))

  # Skip the header row
  next(reader, None)

  # Yield each row as a result
  for row in reader:
      yield tuple(row)
$$ LANGUAGE 'plpython3u';

Using a set returning function, it is much easier to integrate the function into the database infrastructure, so we can make the data available via a MATERIALIZED VIEW.

CREATE MATERIALIZED VIEW popn_python AS
SELECT *
  FROM read_csv_from_url('https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections')
    AS f(year integer, age_18_to_19 integer, all_ages integer);

Updating the remote data just needs a REFRESH MATERIALIZED VIEW, and super user access is not required.

Remote Access with FDW

Our last remote access trick uses a "foreign data wrapper", specifically the OGR FDW which exposes the multi-format access capabilities of the GDAL library to PostgreSQL.

While this example shows CSV file reading, the OGR FDW extension can be used to access a huge number of different formats, both local and remote.

The hardest part of using the OGR FDW driver is figuring out the correct server string to use in setting up the connection. It is best to start by downloading a copy of GDAL to your workstation and trying out various options using the ogrinfo tool.

With some trial and error, I found that a working URL involved:

  • using the vsicurl remote access driver (check out the other "virtual file system" drivers provided to get a feel for just how flexible GDAL is for remote data access),
  • prepending CSV to hint to GDAL what format driver to use, and
  • appending &/popn to the URL to trick GDAL into using "popn" as the layer name instead of something much less attractive.

The result can connect to the remote source and understand the CSV file contents.

$ ogrinfo CSV:"/vsicurl/https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections&/popn"

INFO: Open of `CSV:/vsicurl/https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections&/popn'
      using driver `CSV' successful.
1: popn (None)

For most sources, like remote databases and so on, the URL will be a lot simpler and obvious. Even a remote CSV file will usually be easier, because it will have a CSV file name at the end of the URL, which GDAL uses to hint the correct driver.

CREATE SERVER myserver
  FOREIGN DATA WRAPPER ogr_fdw
  OPTIONS (
    datasource 'CSV:/vsicurl/https://docs.google.com/spreadsheets/d/1pBbCabAK6u6EIuyu_2XUul4Yxvf2w_Od6QYC_yEc4q4/gviz/tq?tqx=out:csv&sheet=Population_projections&/popn',
    format 'CSV');

Now that we have a "server", we can import the one layer that exists in that server.

If our server was something more sophisticated, like a database, there could potentially be multiple tables that would be imported using this method.

CREATE SCHEMA fdw;

IMPORT FOREIGN SCHEMA ogr_all
	FROM SERVER myserver
	INTO fdw;

SELECT * FROM fdw.popn;
 fid | year | n18_to_19 |  total
-----+------+-----------+---------
   1 | 2024 | 120107    | 5485084
   2 | 2025 | 123484    | 5563798
   3 | 2026 | 128627    | 5641925
   4 | 2027 | 132540    | 5719109
   5 | 2028 | 134067    | 5796302

We have data! Unfortunately, since it is coming from a text CSV file, without any column type mapping, we need to do a little bit of type coercion to get a clean table of integers.

CREATE MATERIALIZED VIEW popn_fdw AS
  SELECT year::integer      AS year,
         n18_to_19::integer AS age_18_to_19,
         total::integer     AS all_ages
  FROM fdw.popn;

Using a materialized view keeps our database from constantly hitting the remote table every time we access the FDW table. Since we have a materialized view, the refresh method is a little prettier than truncating and reloading, though it is effectively the same thing: just refresh the view.

REFRESH MATERIALIZED VIEW popn_fdw;

As with the http approach and unlike using COPY, the FDW approach does not require a superuser to do the refresh step.

Conclusions

  • There are lots of ways to access remote data!
    • Using COPY with PROGRAM is simple but requires superuser powers and only reads CSV.
    • Using the http extension is simple to get data but requires parsing it yourself on the database side.
    • Using a PL/Python function results in a short and reusable piece of code, and does not require superuser powers.
    • Using the ogr_fdw extension involves some fiddly setup but is nice and clean once it is up and running. It can also read vastly more different file formats and data services.