Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more
Paul Ramsey
Paul Ramsey
PostgreSQL can provide high performance summaries over multi-million record tables, and supports some great SQL sugar to make it concise and readable, in particular aggregate filtering, a feature unique to PostgreSQL and SQLite . A huge amount of reporting is about generating percentages: for a particular condition, what is a value relative to a baseline. Here's a quick "sales table" with three categories ("a" and "b" and "c") and one million random values between 0 and 10: In the bad-old-days,...
Read MoreGreg Sabino Mullane
Greg Sabino Mullane
Data checksums are a great feature in PostgreSQL. They are used to detect any corruption of the data that Postgres stores on disk. Every system we develop at Crunchy Data has this feature enabled by default. It's not only Postgres itself that can make use of these checksums. Some third party tools such as the awesome pgBackRest program can also use them for enhanced data integrity. Sadly, enabling data checksums is still not the default behavior when creating a new Postgres cluster. When you...
Read MoreDavid Christensen
David Christensen
We spend time day in, day out, answering the questions that matter and coming up with solutions that make the most sense. However, sometimes a question comes up that is just so darn…interesting that even if there are sensible solutions or workarounds, it still seems like a challenge just to take the request literally. Thus was born this blog series, Devious SQL. Devious: "longer and less direct than the most straightforward way." The basic question/request that inspired this article (and in fact...
Read MoreDavid Christensen
David Christensen
There have certainly been times when using PostgreSQL, that I’ve yearned for an or statement with a feature. While the SQL standard itself has no say in the matter as of SQL:2016, there are definite cases of existing SQL database dialects that support this. Sadly, if you try to do something like this in PostgreSQL, this is the result: Before we dig into the nitty-gritty, let's look at some use cases for such a feature. The primary desire for this behavior is to break large transactions up...
Read MoreJoe Conway
Joe Conway
This is the first in a series of blogs on the topic of using PostgreSQL for "data science". I put that in quotes because I would not consider myself to be a practicing data scientist, per se. Of course I'm not sure there is a universally accepted definition of data scientist. This article provides a nice illustration of my point. I do believe my credentials are such that no one can accuse me of term appropriation. Toward establishment of that end, this first installment is a walk down memory l...
Read MorePaul Ramsey
Paul Ramsey
A surprisingly common problem in both application development and analysis is: given an input name, find the database record it most likely refers to. It's common because databases of names and people are common, and it's a problem because names are a very irregular identifying token. The page " Falsehoods Programmers Believe About Names " covers some of the ways names are hard to deal with in programming. This post will ignore most of those complexities, and deal with the problem of matching up...
Read MoreSteve Pousty
Steve Pousty
Today we are going to walk through some of the preliminary data shaping steps in data science using SQL in Postgres. I have a long history of working in data science , including my Masters Degree (in Forestry) and Ph.D. (in Ecology) and during this work I would often get raw data files that I had to get into shape to run analysis. Whenever you start to do something new there is always some uncomfortableness . That “why is this so hard” feeling often stops me from trying something new, but...
Read MoreSteve Pousty
Steve Pousty
Greetings friends! We have come to our final blog post in my series about the data science workflow using PostgreSQL. In the last blog post , we used PL/R to create a function which returns the output from a logistic regression model trained on our fire data. We then took that model object and stored it into a separate table. Today we are going to finish up by showing how to use that stored model to make predictions on new data. By the way, I did all of the Postgres work for the entire blog s...
Read MoreSteve Pousty
Steve Pousty
Greetings friends! We have finally come to the point in the Postgres for Data Science series where we are not doing data preparation. Today we are going to do modeling and prediction of fire occurrence given weather parameters… IN OUR DATABASE! Quick recap: 1. We found some data on historical fires and historical weather data for California. 2. We fixed up the data and merged it to make it more appropriate for logistic regression. 3. We randomly sampled non-fire data and split the data int...
Read MoreSteve Pousty
Steve Pousty
In our last blog post on using Postgres for statistics , I covered some of the decisions on how to handle calculated columns in PostgreSQL . I chose to go with adding extra columns to the same table and inserting the calculated values into these new columns. Today’s post is going to cover how to implement this solution using PL/pgSQL . I bet your first question is going to be: Why did you choose to do this in PL/pgSQL rather than just some quick editor work? I actually started with handcrafti...
Read More