Postgres Databases and Schemas
In my career I've worked on teams that ran and managed over a million databases, and I've helped shard and scale database for customers at 100s of terabytes.
Postgres is an incredibly flexible database, and because of it's flexibility you have a lot of options for architecting your data design for your application needs. In talking with current customers and future customers how you design and architect your database is a common topic. We're going to dig into some of the flexibility on architecting your database and terminology within Postgres and the trade-offs for different approaches.
Starting with an instance
For simplicity sake we're going to assume an instance, whether EC2 or in your own data center is how you're running Postgres. Yes, you have plenty of options to carve an instance up into smaller resources using LXC or cgroups, or you can fully containerize and run Kubernetes to further divide up your resources. If you're in that world and using our Kubernetes Operator you can for the sake of the rest of this post assume a container is the equivalent of an instance.
A running Postgres is a cluster
Under the covers this is essentially the running Postgres process. Sometimes you may refer to your running "database", but as you'll see later on calling it a database may lead to potential confusion. The running Postgres typically has a base backup and then associate write-ahead-log (WAL) that will be archived for it with a tool like pgBackRest. The backup and WAL contain everything for your Postgres cluster, it's all grouped together and not separated out by default. We covered streaming replication in our other post on logical replication and the difference between the two.
Within your Postgres cluster is a database
When you start to work with your database you never connect to a "cluster" but rather to a specific database. Let's break it down a little from a Postgres connection string:
postgres://craig.PyLpvXPJUgzUBteEAKY@p.vcrwzi5u5beilpsgyy.db.postgresbridge.com:5432/mydatabase
If we break apart the connection string.
craig
- is the usernamePyLpvXPJUgzUBteEAKY
- is the passwordp.vcrwzi5u5beilpsgyy.db.postgresbridge.com
- is the hostnamemydatabase
- is the database within the cluster you're connecting to5432
- is the port it is connecting on
The database portion can be postgres
which is the default Postgres database. If you connect to that database with the superuser you can also create new databases. Databases cannot be easily joined between each other (I say easily because you can use the Postgres foreign data wrapper as means of connecting between them). When you do a pg_dump
against a Postgres database it is only that database, where-as the base backup and WAL is for all of the databases contained within a cluster.
Once you're connected to your specific database then you start designing your schema.
Inside a Postgres database
The term schema, similar to database, can be equally confusing. Most of the time when you talk about schema you mean the tables and columns you create inside your database. Postgres also has a notion of schemas, schemas are logically separated by a namespace within the database. By default you're mostly working with the public
schema which you get by default, but be careful if you’re on PG15 as there are some changes with public
. Inside Postgres you have what is known as a search_path:
show search_path;
search_path
-----------------
"$user", public
(1 row)
You can update your search path to search across multiple schemas, it will search them in order for matching table names. You can also execute fully qualified queries that include the schema name. The below queries with a table orders
in your public
schema and the default search path are effectively the exact same query:
SELECT *
FROM orders;
SELECT *
FROM public.orders;
Clusters, databases, schemas, so what?
So we’ve told you a bunch of information about clusters, databases, and schemas, but you’re wondering how it fits in practically.
Clusters are the full instance that is running Postgres itself. You’re not getting away from this one, but know that base backups and WAL from a cluster is all grouped together. Thus if you’re needing to do something with smaller sets of backups and querying you’ll want to dig into how you separate things in to separate databases or schemas. As a broad rule 90% of applications have a cluster with a single database and single schema within them.
Databases are useful when entirely isolated. If you have multiple different applications, but a small number of them so you’re not worried about scaling connections to each of them, then they’re a great method of separating your data access. For example, you could have a database for your blog, one for your events calendar, and one for your CRM application. All these are quite distinct and don’t interact with each other.
Schemas can be useful for separating data that may otherwise follow the same table structure. In most cases you may not want to query across the full set of schemas, but for out of band reporting it could be useful. One of the most common practical applications of schemas is multi-tenant applications where you want to create a schema for user1, user2, user3. All the tables within those could look the same, but if you want analytical queries across your users you can easily do that.
Hopefully this gives you a good primer on something that you’ll often hear/see talked about (databases and schemas) in Postgres, both of which can be very overloaded to talk about a generic term of something very technical Postgres specific which can have very different implications.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read