
PostgreSQL Storage Hierarchy Visualization
Imagine you’re organizing a huge library. You have shelves (files), books (pages), and individual words in those books (tuples). Understanding how PostgreSQL organizes data is similar – and just as important for a database administrator.
In this post, I’ll walk you through PostgreSQL’s storage system using simple comparisons and real-world examples, so you can understand how your database actually stores information on disk.
What We’ll Cover
- How PostgreSQL organizes data (the big picture)
- Files: Where your data lives on disk
- Pages: How data is grouped together
- Tuples: The individual pieces of information
- Why this matters for your daily work
- Common problems and simple solutions
The Big Picture: How PostgreSQL Organizes Data
Let’s start with a simple analogy:
Think of PostgreSQL like an office building:
- The entire building is your database cluster
- Each floor is a separate database
- Each room on a floor represents a table
- Filing cabinets in those rooms are the files on disk
- Each drawer in a cabinet is a page (or block)
- The individual folders in each drawer are tuples (your rows of data)
When you run a query, PostgreSQL needs to find the right building, go to the correct floor, enter the right room, open the proper filing cabinet, pull out the right drawer, and finally grab the folders you need.
Understanding this structure helps you organize your data efficiently and troubleshoot problems when they arise.
Files: Where Your Data Lives
What Are Database Files?
Database files are the actual files on your computer’s disk that store all your PostgreSQL data. When you create a table in PostgreSQL, the system creates one or more files to hold that table’s data.
The Filing System
Let’s say you create a simple customer database:
CREATE DATABASE my_store;
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT,
email TEXT
);
Behind the scenes, PostgreSQL creates:
- A directory for your database (named with a number like
13267
) - A file for your customers table (named with a number like
16384
) - Additional helper files to track available space and other information
If you could look at your PostgreSQL data directory, you’d see something like:
PostgreSQL_data/
├── base/
│ └── 13267/ <- Your database directory
│ ├── 16384 <- Your customers table
│ ├── 16384_fsm <- Helps find free space in your table
│ └── 16384_vm <- Tracks which parts of your table need cleaning
├── global/ <- System-wide tables
└── pg_wal/ <- Transaction logs (in case of crashes)
Real-World Example
Think of it like this: If you run a small bookstore, your filing system might include:
- A cabinet for customer information
- A cabinet for inventory
- A cabinet for sales records
In PostgreSQL, each of these “cabinets” would be a separate file (or set of files) on disk.
Pages: The Building Blocks
What Is a Page?
A page (also called a block) is a fixed-size chunk of data, typically 8KB (8,192 bytes). Think of it as a drawer in your filing cabinet that can hold a certain amount of information.
Why Pages Matter
PostgreSQL reads and writes data one page at a time. It’s like how you might pull out an entire drawer to find a document rather than searching through the cabinet blindly.
Inside a Page
Each page has:
- A header (information about what’s in the page)
- Pointers (like a table of contents showing where each row is located)
- Free space (empty area for new data)
- The actual data rows (your information)
Here’s a simple visualization:
+------------------+
| PAGE HEADER | <- Metadata about this page
+------------------+
| ITEM POINTERS | <- "Table of contents" for this page
| |
+------------------+
| FREE SPACE | <- Room for more data
| |
+------------------+
| ROW 3 | <- Your actual data
+------------------+
| ROW 2 |
+------------------+
| ROW 1 |
+------------------+
Real-World Example
Imagine each page as a form where you can record up to 20 customers (depending on how much information you store for each). If you have 100 customers, you’ll need at least 5 forms (pages).
When you want to find customer #37, PostgreSQL needs to figure out which page contains that customer, then find the specific customer within that page.
Tuples: Your Actual Data
What Is a Tuple?
A tuple is simply a row in your database table. If you have a customers table, each customer record is a tuple.
Inside a Tuple
Each tuple has:
- A header (hidden information about the row)
- Your actual data (the values you inserted)
The header includes important information like:
- Which transaction created this row
- Which transaction deleted this row (if applicable)
- Information about which columns contain NULL values
Example of a Tuple
Let’s say you insert a customer:
INSERT INTO customers (name, email)
VALUES ('Alice Smith', 'alice@example.com');
In the database, the tuple might look conceptually like this:
HEADER: [Created by transaction #1234, Not deleted, No NULLs]
DATA: [1, 'Alice Smith', 'alice@example.com']
The TOAST System (Handling Large Values)
What happens if you need to store something big, like a long document or an image?
PostgreSQL uses something called TOAST (The Oversized-Attribute Storage Technique). Think of it like this:
If you’re organizing paper files and come across a document that’s too big for your folder, you might:
- Try to fold it up smaller (compression)
- If it’s still too big, store it in a separate box and put a note in your folder saying “see box #5” (out-of-line storage)
This is exactly what PostgreSQL does with large values:
- Try to compress the data
- If still too big, store it separately and keep a reference
Real-World Example
Imagine you’re storing product descriptions. Short descriptions fit directly in your main product table. But for products with very lengthy descriptions, PostgreSQL automatically places them in a separate “storage box” (TOAST table) and leaves a small pointer in your main table.
Why This Matters for Database Administrators
Understanding the storage system helps you:
1. Estimate Storage Needs
If you know each page is 8KB and can hold roughly X number of your typical rows, you can estimate how much disk space a million rows will require.
-- Find the average row size
SELECT avg(pg_column_size(t)) FROM customers t;
-- Find how many rows are likely to fit on a page
-- (Approximately 8KB / avg_row_size)
2. Improve Performance
Knowing that PostgreSQL reads data one page at a time explains why:
- Indexes are important (they help PostgreSQL find the right pages quickly)
- Sequential scans (reading every page) are slow for large tables
- Well-organized tables perform better
3. Understand and Fix Bloat
When you update a row, PostgreSQL usually creates a new version and eventually marks the old one as deleted. But the old version still takes up space until a VACUUM operation reclaims it.
This is like having folders with crossed-out information still taking up space in your filing cabinet. Eventually, you need to clean up – that’s what VACUUM does.
-- Basic cleanup
VACUUM customers;
-- Cleanup and update statistics
VACUUM ANALYZE customers;
4. Diagnose Space Issues
When your database is growing too fast, you can check which tables are taking up the most space:
-- See the size of your biggest tables
SELECT
tablename,
pg_size_pretty(pg_table_size(tablename)) AS table_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_table_size(tablename) DESC
LIMIT 5;
Common Problems and Simple Solutions
Problem 1: Database Growing Too Large
Signs:
- Disk space alerts
- Slow performance
- Backups taking too long
Simple Solutions:
- Regular VACUUM to reclaim space from deleted rows
- Archive old data you rarely need
- Check for unnecessarily large columns (like storing entire files in the database)
Example:
-- Schedule regular cleanup
VACUUM ANALYZE;
-- For tables with many deletes/updates
ALTER TABLE frequently_changed_table SET (autovacuum_vacuum_scale_factor = 0.1);
Problem 2: Slow Queries on Large Tables
Signs:
- Queries that were once fast become slow
- CPU usage spikes during queries
Simple Solutions:
- Add appropriate indexes
- Break very large tables into smaller ones (partitioning)
- Make sure VACUUM ANALYZE runs regularly to update statistics
Example:
-- Add an index for commonly searched columns
CREATE INDEX customers_email_idx ON customers(email);
-- Update statistics
ANALYZE customers;
Problem 3: Database Bloat
Signs:
- Database size much larger than the actual data would suggest
- Performance degradation over time
Simple Solutions:
- Regular VACUUM
- Occasional VACUUM FULL (requires downtime)
- Consider the pg_repack extension for online rebuilding
Example:
-- Find tables with bloat
SELECT
schemaname || '.' || tablename AS table_name,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size,
pg_size_pretty((pg_stat_get_live_tuples(relid) *
(SELECT avg(pg_column_size(t)) FROM tablename t))::bigint) AS estimated_actual_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 10;
Visualizing the Storage Hierarchy
Let’s tie everything together with a concrete example:
Imagine you have a table with customer orders:
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
If you insert 10,000 orders:
- Files: PostgreSQL might store this in a single file (e.g.,
16385
) - Pages: This file would be divided into multiple 8KB pages, perhaps 100-200 pages depending on the row size
- Tuples: Each page contains multiple orders, perhaps 40-80 orders per page
When you run a query like:
SELECT * FROM orders WHERE order_id = 5432;
PostgreSQL:
- Looks at the index to find which page contains order #5432
- Reads that entire page into memory
- Finds the specific tuple within the page
- Returns the data to you
Understanding this process helps you see why proper indexing is crucial – it’s the difference between checking the right drawer immediately versus opening every drawer in every cabinet!
Practical Tips for Beginners
- Monitor Your Database Size
-- Check sizes of your tables
SELECT tablename, pg_size_pretty(pg_relation_size(tablename))
FROM pg_tables WHERE schemaname = 'public';
- Make VACUUM Part of Your Routine
-- Basic cleanup (can run while database is being used)
VACUUM;
-- More thorough cleanup (requires exclusive lock)
VACUUM FULL;
- Choose Appropriate Data Types
- Use the smallest data type that will accommodate your data
- For example, use VARCHAR(50) instead of TEXT for short strings if you know they won’t exceed 50 characters
- Index Thoughtfully
- Add indexes for columns you frequently search or join on
- But remember each index increases the storage and slows down writes
- Regular Maintenance
- Set up autovacuum correctly
- Monitor for bloat
- Check for fragmentation
Conclusion
Understanding PostgreSQL’s storage system doesn’t need to be complicated. Think of it as an organized filing system with:
- Files as your filing cabinets
- Pages as the drawers in those cabinets
- Tuples as the individual folders in each drawer
By understanding this structure, you can make better decisions about:
- How to organize your data
- How to optimize queries
- How to maintain your database for long-term performance
Even if you’re starting out as a PostgreSQL DBA, this fundamental knowledge will serve as a strong foundation as you develop your skills and take on more complex database challenges.
What aspects of PostgreSQL storage would you like to explore next? Understanding how indexes work with this storage system? Diving deeper into VACUUM operations? Let me know what interests you most!