Introducing Crunchy Data Warehouse: A next-generation Postgres-native data warehouse. Crunchy Data Warehouse Learn more

Musings of a PostgreSQL Data Pontiff Episode 1

Avatar for Joe Conway

Joe Conway

6 min read

Introduction to a PostgreSQL "Data Science" Blog Series

This is the first in a series of blogs on the topic of using PostgreSQL for "data science". I put that in quotes because I would not consider myself to be a practicing data scientist, per se. Of course I'm not sure there is a universally accepted definition of data scientist. This article provides a nice illustration of my point.

I do believe my credentials are such that no one can accuse me of term appropriation. Toward establishment of that end, this first installment is a walk down memory lane.

Sometime around the end of the 1990's I had a boss (hi Roger!) that dubbed me "The Data Pontiff". I always liked that moniker because data collection, storage, and analysis has always been my thing.

In fact, ever since grammar school I have been interested in math and science. In high school and college I additionally became fascinated with computers. My college degree was Mechanical Engineering, and I still remember in the early 1980's one of my projects involving the use of a Commodore 64 to do tensor calculations (force and stress) of a crane at incremental positions as it loaded and unloaded a ship. I thought that was the coolest thing since sliced bread at the time.

During the latter half of the 1980's while I was in the Navy, one of my first duty stations was that of Diving Officer. Among other things, it was my responsibility to calculate the required water level for various tanks such that when the submarine submerged, it would stay on the prescribed depth with neutral buoyancy and a zero "bubble". That is, stay on depth with zero angle. The standard method for doing this would get you a reasonable approximation, but I always wanted to get as close to perfect as I could. So I would take other available information into account, such as the water temperature and salinity gradients versus depth, if they were known (i.e. we had recently been down already, or approximations based on historical data).

When I landed my first civilian job, I continued and doubled down, on my thirst for data and analysis. That company produced components for commercial nuclear reactors. As you might imagine, the requirements for testing and measuring were stringent, and a significant amount of data was collected to prove compliance within the specified tolerances. However when I initially arrived almost nothing was done with all that data beyond proving compliance. I set out over the course of a few years to change that, ensuring the data was retained in accessible form, and analysis was done which was used to iteratively improve the manufacturing processes. That paid off since it saved us money in scrapped components and allowed us to win contracts with newly tightened specs that our competitors could not meet. The tighter specs allowed greater operating margin for our customers and saved them expensive uranium, so it was a win all around.

It was the late 1990's at my second civilian job where I finally earned my title of "The Data Pontiff". That company produced very expensive, large, complex, and cutting edge industrial excimer lasers used by semiconductor manufacturers to make integrated circuits (a.k.a. chips). These lasers track a large number of configuration parameters and operational metrics. But for all their sophistication, the data was essentially restricted to an on board snapshot. I started two important initiatives there. One was to comprehensively store test and measurement data collected during manufacturing (see a trend here?). We called that POD, for Parametric Online Data.

The second project (which actually came first chronologically) involved attaching a device to the RS-232 diagnostic port of the lasers and periodically polling and storing the snapshots centrally. That project was called COL. The result was comprehensive information about each laser frame, and excellent aggregate data about the entire fleet. Our more advanced users of this data were able to create predictive models for when major components were failing or nearing end of life. In the semiconductor industry, downtime is measured in millions of dollars per hour, so coordinating maintenance in a predictable and scheduled way was a huge benefit. As was reducing downtime by having historical diagnostics to consult when things went awry. The aggregate data was useful for our executive team to keep the pulse of the entire industry. Our lasers made up something like 75% of the installed base in the free world at the time, and with the aggregate data we collected we could see in almost real time when the industry was ramping up or ramping down.

Finally, this data allowed the creation of an entirely new business model where the lasers were essentially leased and charged based on usage. You might think of it like Excimer Laser as a Service (ELaaS). By the way, the data underpinning all of this was stored in PostgreSQL, and as of about a year ago I was told that POD was still in service!

Sometime around 2003 I wrote PL/R, which is a procedural language handler for PostgreSQL that allows the use of R functions directly from PostgreSQL. It is essentially the same as PL/Python or PL/Perl in that the R interpreter gets fired up directly inside the PostgreSQL backend process associated with your database connection. As such, the embedded R function has direct access to data stored in tables and can call SQL statements making use of any other functions as well. PL/R was initially written specifically because I wanted to be able to use it to analyze data stored in POD and COL.

Anyway, there is much more to the story of each of those experiences but I have already risked boring you with my tales. In the years since I left civilian job #2, I have mainly focused on helping others use PostgreSQL in the most productive and secure way possible. But I have also tried to keep up on the side with trendy forms of data analysis including various statistical methods, machine learning, AI, etc. My goal in sharing all of the above is to illustrate some examples of using data and analysis to produce real world positive results. For me, that was always the allure.

In this blog series I hope to explore the possibilities for analysis presented by PostgreSQL through procedural languages such as PL/R and PL/Python as well as perhaps built-in capabilities of PostgreSQL itself. I hope you will find them as useful to read as I find them fun to write!