Postgres Internals: How Postgres Stores and Retrieves Your Data

Do you ever wonder what happens between the moment you execute your query and when you get your results? Postgres query execution is a beautifully orchestrated process that involves parsing, planning, and execution, with the query planner making critical decisions about how to retrieve your data efficiently.

In this post, I'm going to show you how this works behind the scenes. But to understand how Postgres retrieves data, we first need to understand how it stores data, and the two are more connected than you might think.

A Simple Users Table

Let's keep everything simple. Imagine we have a basic users table, that has this structure:

CREATE TABLE users (
  id       SERIAL PRIMARY KEY,
  username VARCHAR(50) UNIQUE NOT NULL
);
        
CREATE INDEX index_users_username ON users (username);

We are going to use this users table to break down inner Postgres mechanisms.

1. How Postgres Stores Data on Disk

Before we get into the details of how Postgres fetches data, we should first look at how that data is actually stored on disk. Understanding the storage method is crucial for figuring out why some queries are slow even when you have the right indexes in place.

Postgres uses 8192-byte (8 KB) fixed-size blocks, known as pages, for data storage. This size works well with operating system page sizes (usually 4 KB) and disk sector sizes, providing a good balance between I/O efficiency and space. It's configurable at compile time, but almost nobody changes it.

These 8 KB pages have a limitation: tuples (or rows) can't be split across multiple pages. This presents a problem when dealing with large field values, such as TEXT, JSONB, or bytea. To mitigate this, Postgres uses TOAST (The Oversized-Attribute Storage Technique), which compresses and breaks large fields into smaller chunks, subsequently storing them in a separate TOAST table.

This 8 KB page size allows Postgres to be efficient in most cases. When it reads data from disk, it reads entire pages. When it writes data, it writes entire pages. Since pages are fixed-size blocks, Postgres can efficiently cache frequently accessed ones in shared buffers, reducing disk I/O.

2. The Three Forks: Heap, FSM, and Visibility Map

As we know, Postgres uses 8 KB pages, a design choice that achieves a balance between efficiency and storage. However, not all pages are equal. A single table has three page types (or forks), each stored in separate files: heap pages, which hold your actual row data; FSM pages, which track the free table space; and VM pages, which manage row visibility at page level. While they all share the same 8 KB structure, the contents of each are entirely different.

Page/Fork File Suffix Purpose
Main Heap pages containing actual row data
FSM _fsm Free Space Map — tracks free space in each page
VM _vm Visibility Map — tracks which pages are fully visible

Heap Pages store the actual table rows. We're going to discuss these in detail in the next section.

Free space map (FSM) pages are used to track and locate available space within a table or a index. Postgres uses these to efficiently reuse space for inserts and updates, reducing table bloat. FSM pages are updated during INSERT or UPDATE operations and are primarily managed by VACUUM operations.

Visibility maps (VM) are pages related to heap pages that track which pages contain tuples (rows) visible to all active transactions. These are primarily used during VACUUM and Autovacuum operations. VMs store two bits per page: the first one (all-visible bit) indicates that every row version on that page is visible to current and future transactions. The second one (all-frozen bit) indicates if all rows in the page are "frozen," meaning they're old enough so that Postgres no longer needs to track their transaction IDs.

The good part? We can actually look inside these pages. We can use pageinspect extension to inspect the internal, low-level contents and structures of our users table pages and rows.

CREATE EXTENSION pageinspect;

Let's check our users table disk structure:

blog=# SELECT pg_relation_filepath('users');

pg_relation_filepath 
----------------------
 base/461880/461882
(1 row)
    

The pg_relation_filepath function returns the output relative to the main Postgres data directory: base/{DB_OID}/{REL_FILENODE}. In this case, the database OID is 461880 and the users table filenode is 461882. The FSM and VM forks are located at 461882_fsm and 461882_vm, respectively. As we know the physical location of our data, we can now examine the page structure:

❯ ls -la /Users/leonardodarosa/gdk/postgresql/data/base/461880/461882*

-rw-------  1 leonardodarosa  staff   8192 Jan  7 19:38 /Users/leonardodarosa/gdk/postgresql/data/base/461880/461882
-rw-------  1 leonardodarosa  staff  24576 Jan  7 19:39 /Users/leonardodarosa/gdk/postgresql/data/base/461880/461882_fsm
-rw-------  1 leonardodarosa  staff   8192 Jan  7 19:39 /Users/leonardodarosa/gdk/postgresql/data/base/461880/461882_vm
    

We can interpret this as: base/my_database/users, base/my_database/users_fsm, and base/my_database/users_vm.

We can do the same for the index_users_username index:

blog=# SELECT pg_relation_filepath('index_users_username');

pg_relation_filepath 
----------------------
 base/461880/461890
(1 row)
    

For the users table, we have the following structure:

$PGDATA/base/461880/
├── 461882      Users Main (heap pages with row data)
├── 461882_fsm  Users Free Space Map (FSM)
└── 461882_vm   Users Visibility Map (VM)
|
├── 461890      index_users_username index
├── 461890_fsm  Visibility Map
Note: Small tables don't immediately have FSM or VM files right after table creation. Instead, Postgres waits until it's necessary to have them. You might need to run VACUUM first or simply populate the table with enough data for those files to be created.

Each fork is essentially a sequence of 8 KB pages. The main fork contains your data. The FSM and VM are smaller files that track the metadata associated to each the heap page.

2.1 Anatomy of Heap Pages

Heap pages store the actual table rows. Each page has a fixed structure:

Section Size Description
Page Header 24 bytes Metadata: LSN (WAL position), checksum, flags, and pointers to free space.
Line Pointers 4 bytes each Array of pointers to tuples. Each entry has the offset and length of a tuple. New pointers are added at the end.
Free Space Variable Empty space between line pointers and tuples. Shrinks as data is added.
Tuples (rows) Variable Actual row data. Each tuple has a 23-byte header (xmin, xmax, ctid, etc.), optional null bitmap, and column values.
Special Space 0 bytes Used by indexes for extra metadata. Empty for heap pages.

Perfect. Now we know how data is stored on disk and how Postgres structures everything into 8 KB pages. But what's actually inside a tuple? Let's take a closer look at how a row is stored.

3. What Happens During an INSERT

When you execute an INSERT statement, Postgres doesn't immediately write your data to a file. Instead, each row gets a header with internal metadata. This includes details such as the transaction that created the row, its location within the page, and other flags. Consider the following statement:

INSERT INTO users (id, username) VALUES (4, 'l.rosa');

When you execute the INSERT statement, Postgres first needs to find space for the row. Remember the FSM fork? Postgres checks it to find a page with enough room for our row. If no page has space, it adds a new one. Next, it loads the page into memory. If the page isn't already cached, Postgres reads the whole 8 KB page from disk, then it builds the tuple. Our data, (4, 'l.rosa'), gets a header attached with some internal metadata. The columns, then, are packed as raw bytes. Now Postgres writes to the page and adds a line pointer (to track where the tuple is) and writes the tuple into the free space.

Finally, the page is marked as "dirty," but it doesn't go to disk right away. This might sound risky, because, what if Postgres crashes? That's precisely where the Write-Ahead Log (WAL) comes in. Before Postgres even alters the page in memory, it writes a record of the change to the WAL. The WAL is sequential and gets flushed to disk immediately on commit. So the change is preserved, even though the actual page write is deferred.

Postgres optimizes performance by batching page writes (during checkpoints). However, the WAL ensures that committed data isn't lost. This "log first, write later" approach is why it's called write-ahead logging.

Let's see what we actually wrote to disk:

SELECT lp, t_ctid, encode(t_data, 'escape') AS readable_data
FROM heap_page_items(get_raw_page('users', 0))
WHERE t_data IS NOT NULL;
     
 lp | t_ctid |        readable_data        
----+--------+-----------------------------
  4 | (0,4)  | \x01\000\000\000\ralice
  5 | (0,5)  | \x02\000\000\000        bob
  6 | (0,6)  | \x03\000\000\000\rmaria
  7 | (0,7)  | \x04\000\000\000\x0Fl.rosa
(4 rows)

But we're not done yet. Remember we have an index on username? Postgres has to update it too. It finds the proper position within the B-tree and then adds a new entry. This entry holds the indexed value, 'l.rosa', along with a pointer, the TID, that directs back to the tuple's location. Essentially, the index entry looks like this: 'l.rosa' -> (0, 7). The username points to page 0, line 7.

SELECT itemoffset, ctid, data 
FROM bt_page_items('index_users_username', 1);
     
 itemoffset | ctid  |          data           
------------+-------+-------------------------
          1 | (0,4) | 0d 61 6c 69 63 65 00 00
          2 | (0,5) | 09 62 6f 62 00 00 00 00
          3 | (0,7) | 0f 6c 2e 72 6f 73 61 00
          4 | (0,6) | 0d 6d 61 72 69 61 00 00
(4 rows)

The ctid is the tuple ID that points to the location of the row in the heap. While index entries are sorted in alphabetical order, the ctid values reflect the order in which the rows were inserted. That's the whole point of an index: providing sorted access to unsorted data.

Keep in mind: Every time you INSERT, UPDATE, or DELETE a row, every index on that table gets updated. If you have five indexes, that's five extra writes for each row change. This is why you should be careful about adding indexes to large tables: they make reading data faster, but slow down writing. This is why bulk operations can be slow on heavily indexed tables

4. The Query Lifecycle

Now we know how Postgres stores data, through pages, tuples, indexes, all of it. But what happens when we query that data? When we run SELECT * FROM users WHERE username = 'l.rosa', Postgres processes it through some phases:

Phase What it does
Parse checks the syntax, builds parse tree
Plan chooses the best execution strategy
Execute runs the plan, returns results
Note: Postgres actually has a few more phases (like Analyze and Rewrite), but we'll focus on these three to keep things simple.

4.1 Parse

When Postgres receives your query, it first checks if it's a valid SQL statement. Is the syntax correct? Are the keywords in the right place? If you write SELEC * FROM users, it'll catch the typo here. After validation, Postgres converts the SQL string into a parse tree, a structured representation it can use internally.

4.2 Plan

Now Postgres has to figure out how to get the data. Should it scan every row (seq scan)? Use an index (index scan, index-only scan)? If there are joins, what's the order? The planner checks the table statistics and available indexes to estimate the cost of different approaches, and it picks the cheapest one.

The query planner uses table statistics to determine things like how many rows we have in the table, how data is distributed, etc. If the statistics are outdated or inaccurate, the planner can make bad decisions. That's why you might see a slow query getting faster after running ANALYZE my_table;.

You can check the actual query plan by prepending EXPLAIN to your query. EXPLAIN shows expected costs based on table statistics without executing the query, while EXPLAIN ANALYZE actually runs the query and shows real runtime statistics.

EXPLAIN ANALYZE SELECT * FROM users WHERE username = 'l.rosa';
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on users (cost=0.00..1.05 rows=1 width=122) (actual time=0.007..0.007 rows=1 loops=1)
   Filter: ((username)::text = 'l.rosa'::text)
   Rows Removed by Filter: 3
 Planning Time: 0.045 ms
 Execution Time: 0.015 ms
(5 rows)

Something curious happened in the query plan: we have an index on username, but the planner chose a sequential scan (Seq Scan on users) anyway. Why? The table has just 4 rows sitting in a single page. Reading that one page and checking 4 rows is faster than traversing the B-tree index, finding the pointer, and then fetching the heap tuple. The planner does the math and picks the less expensive path.

We can compare how the query performs with an index by forcing Postgres to use index_users_username. We can disable sequential scans for the session with SET enable_seqscan = OFF;:

EXPLAIN ANALYZE SELECT * FROM users WHERE username = 'l.rosa';
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_users_username on users (cost=0.13..2.15 rows=1 width=122) (actual time=0.054..0.055 rows=1 loops=1)
   Index Cond: ((username)::text = 'l.rosa'::text)
 Planning Time: 0.068 ms
 Execution Time: 0.070 ms
(4 rows)

Look at the costs: Seq Scan (0.00..1.05 actual time=0.007..0.007) vs. Index Scan (0.13..2.15 actual time=0.054..0.055). The index scan is more expensive for this tiny table. And there's a reason why: Seq Scan reads one page, check 4 rows. Done. Index Scan traverses the B-tree to find 'l.rosa', get the TID (0,7), then fetch the tuple from the heap. More steps, more overhead.

With just 4 rows, an index is unnecessary. However, as the table grows, the equation changes. Scanning each page sequentially becomes a costly operation, and the index starts to make sense.

4.3 Execute

Finally, Postgres runs the plan. But it doesn't process your query in a "top-to-bottom" way. It follows a logical order of execution:

Order Clause What it does
1 FROM/JOIN Identify source tables, combine rows
2 WHERE Filter rows
3 GROUP BY group the results
4 HAVING filter groups
5 SELECT pick columns, evaluate expressions, apply DISTINCT
6 ORDER BY Sort the results
7 LIMIT/OFFSET Restrict output rows

For each step, Postgres reads pages from disk (or shared buffers if cached), finds the tuples, checks if they're visible to the current transaction, and returns the results.

Page access: FROM load pages, JOIN may re-access the pages, WHERE checks the visibility on buffered data, SELECT works in memory.

This execution order also explains a common misconception: some people think that by adding a LIMIT always makes queries fast by fetching only X rows. It depends. Without an ORDER BY, Postgres can stop early once it has enough rows. But adding an ORDER BY without a supporting index, makes Postgres to fetch and sort everything before cutting it down to LIMIT X.

-- This is fast, because Postgres can stop after finding the first 10 rows
SELECT * FROM users WHERE active = true LIMIT 10;

-- This is slow, because Postgres has to fetch ALL active users, sorts them, then limit the results
SELECT * FROM users WHERE active = true ORDER BY created_at LIMIT 10;

In the example above, an index on created_at would let Postgres stop early. It can walk the index in order and quit after 10 rows.

Wrapping Up

We've covered a lot of ground. Let's recap the journey:

Storage: Postgres organizes its data in 8 KB pages. Essentially, your table is just a file on disk, broken down into fixed-size chunks. When dealing with large values, the system uses TOAST to store them in separate chunks.

Forks: A single table has three forks: heap pages, which store your actual data; the Free Space Map (FSM), which monitors available space; and the Visibility Map (VM), which manages data visibility. Though each has a distinct function, they all share the same 8 KB structure.

Tuples: When you perform an INSERT, Postgres finds a page with available space, builds a tuple with a header, and then writes it. Simultaneously, every index associated with the table is also updated.

Query execution: Your query goes through Parse -> Plan -> Execute. The planner picks the cheapest path based on table statistics. Sometimes that's a Seq Scan (reads everything), sometimes an Index Scan (B-tree lookup -> heap fetch).

The tradeoff: Indexes make reading data faster, but they can slow down writing. Sequential scans are better for small tables, whereas index scans are better for large tables. The query planner usually makes the right call, assuming your statistics are up to date.

MVCC: We've mentioned visibility checks, specifically xmin and xmax, a few times now. This is all about MVCC, the way Postgres manages concurrent transactions without locking up reads. It's a complex topic and one that requires a more comprehensive explanation, which I'll get into in the next post.

Until then, try running EXPLAIN ANALYZE on your queries and see what path the planner picks. You might be surprised.

See you in the next article!