Validating Data Types from Semi-Structured Data Loads in Postgres with pg_input_is_valid
4 min readMore by this author
Working on big data loads and or data type changes can be tricky - especially finding and correcting individual errors across a large data set. Postgres versions, 16, 17, and newer have a new function to help with data validation: pg_input_is_valid
.
pg_input_is_valid
is a sql function that can be queried that will determine if a given input can be parsed into a specific type like numeric, date, JSON, etc. Here’s a super basic query to ask if ‘123’ is a valid integer.
SELECT pg_input_is_valid('123', 'integer');
pg_input_is_valid
-------------------
t
This function gives a t-true and f-false response. So if I asked SELECT pg_input_is_valid('123', 'date');
the answer would be `f', since that's not a date.
This does not require special error handling or special scripts, it is just built right into Postgres and can be used with standard SQL. At Crunchy Data we’ve seen some nice use cases with this where you can validate data before importing it. Generally this works best if with a staging table or a temporary table and the validation is done and offending rows can be identified before a final data copy or import is run. Let’s take a look today with a few examples about how the validation input function might help.
Validating data for columns changes
There’s a lot of occasions when a database administrator needs to change data types. You can check something like text to integer easily. You might want to use newer JSON features and move away from old formatting. For moving columns to JSON, pg_input_is_valid
can query existing rows to see if they’d conform to JSONB.
SELECT pg_input_is_valid(data_column, 'jsonb')
FROM bytea_table;
You might also want to use pg_input_is_valid to check text columns you want to use for integer or date. You can use a regular validity check for this or could create a new date column with only data that is valid.
UPDATE test_data
SET
actual_date = CASE
WHEN pg_input_is_valid (maybe_date, 'date') THEN maybe_date::date
ELSE NULL
END;
SELECT * from test_data ;
name | maybe_date | actual_date
-----------+--------------+-------------
David | 2023-01-02 | 2023-01-02
Elizabeth | Jan 1, 2024 |
Validating data for data load
Let’s say you have a CSV file containing customer data to import it into a table named customers
. Before importing, it is a good idea to ensure that the data in the CSV file adheres to the expected format, particularly for the age
and signup_date
columns.
The table has the following structure:
customer_id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
age INTEGER,
signup_date DATE
Create a staging table
Import the CSV data into a staging table without data type casting yet. Everything will go in as text:
CREATE TEMP TABLE staging_customers (
customer_id TEXT,
name TEXT,
email TEXT,
age TEXT,
signup_date TEXT
);
-- copy in the data to the temp table
COPY staging_customers FROM '/path/to/customers.csv' CSV HEADER;
Use pg_input_is_valid
to validate data types
Now we can write queries to identify rows with invalid data. For example, validate that the age column can be an integer and that the signup column can be a date field.
SELECT *
FROM staging_customers
WHERE NOT pg_input_is_valid(age, 'integer')
OR NOT pg_input_is_valid(signup_date, 'date');
This query will return all rows with either an invalid age
or signup_date
.
Exclude invalid rows and copy data to your final table
Once the problematic rows have been identified, the rows can be manually fixed or removed. Sometimes an even cleaner option is to use pg_input_is_valid
to skip bad rows as the data is copied to the table and insert only valid rows.
INSERT INTO customers (name, email, age, signup_date)
SELECT name, email, age::integer, signup_date::date
FROM staging_customers
WHERE pg_input_is_valid(age, 'integer')
AND pg_input_is_valid(signup_date, 'date');
Conclusion
pg_input_is_valid
is a great recent addition to the Postgres toolkit data manipulation - moving data or changing data types. In general, where I’ve seen the best use of pg_input_is valid is doing a two step data import with a staging table, a validation step to check for errors, and a final migration of data. Since this is build right into Postgres itself, whether you’re working with small datasets or millions of rows, pg_input_is_valid
is a scalable, performant, and reliable way to clean and validate your data.
Related Articles
- Validating Data Types from Semi-Structured Data Loads in Postgres with pg_input_is_valid
4 min read
- Incremental Archival from Postgres to Parquet for Analytics
7 min read
- Postgres Parallel Query Troubleshooting
5 min read
- Using Cloud Rasters with PostGIS
10 min read
- Indexing Materialized Views in Postgres
5 min read