Skip to content

MVCC and Concurrency Control

TL;DR

PostgreSQL never makes readers wait for writers because every row carries invisible version stamps -- and understanding this mechanism is the difference between saying "just use transactions" in an interview and actually knowing what that costs.

MVCC row versioning with xmin and xmax transaction IDs


What MVCC Actually Is

Most candidates know that MVCC stands for Multi-Version Concurrency Control. Few know what that means at the row level.

Here's the idea. Every row in PostgreSQL has two hidden columns:

  • xmin: the transaction ID that created this row version
  • xmax: the transaction ID that deleted or updated this row version (0 if still alive)

When you UPDATE a row, PostgreSQL does not modify it in place. It marks the old version with an xmax and inserts a brand new copy with a fresh xmin. The old version stays on disk until VACUUM removes it.

-- You can actually see these hidden columns
SELECT xmin, xmax, * FROM orders WHERE id = 42;

-- Result:
--  xmin  | xmax | id |  status
-- -------+------+----+---------
--  10500 |    0 | 42 | pending

This is radically different from MySQL InnoDB, which stores the current version in-place and keeps old versions in an undo log. PostgreSQL's approach means:

  • Reads never block writes. A reader sees a snapshot, not the live row.
  • Writes never block reads. The writer creates a new version; old readers still see the old one.
  • Writes DO block writes. Two transactions updating the same row -- one waits.

Snapshots: How Transactions See the World

When a transaction starts (or when a statement starts, depending on isolation level), PostgreSQL takes a snapshot. This snapshot records:

  1. The list of all currently in-progress transaction IDs
  2. The next transaction ID that will be assigned

A row version is visible to your transaction if:

  • xmin was committed before your snapshot was taken
  • xmax is either 0 (not deleted) or was not committed at snapshot time

This is why you never see "dirty reads" in PostgreSQL. It's structurally impossible. The row version from an uncommitted transaction simply doesn't pass the visibility check.

Transaction A (xid=100):  UPDATE orders SET status='shipped' WHERE id=42;
                           -- Creates new row version with xmin=100
                           -- Old version gets xmax=100

Transaction B (xid=101):  SELECT * FROM orders WHERE id=42;
                           -- Checks: is xmin=100 committed? No.
                           -- Skips this version, sees the old one.
                           -- No dirty read. Ever.

Transaction Isolation Levels

PostgreSQL isolation levels and their visibility guarantees

PostgreSQL supports three isolation levels. Knowing which one to pick -- and what each one costs -- is a system design interview skill.

READ COMMITTED (Default)

Each statement gets a fresh snapshot. If another transaction commits between your two SELECTs, the second SELECT sees the new data.

-- Transaction A
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees 1000
-- Transaction B commits: UPDATE accounts SET balance = 500 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- sees 500 (new snapshot!)
COMMIT;

Good enough for 95% of applications. This is what every major web framework uses by default.

REPEATABLE READ

The snapshot is taken once, at the start of the transaction. Every query in that transaction sees the same data, no matter what other transactions commit.

-- Transaction A
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 1;  -- sees 1000
-- Transaction B commits: UPDATE accounts SET balance = 500 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1;  -- still sees 1000!
COMMIT;

The cost: if Transaction A tries to UPDATE a row that Transaction B already modified and committed, PostgreSQL aborts Transaction A with a serialization error. Your application must retry.

SERIALIZABLE

The gold standard. PostgreSQL guarantees that concurrent transactions produce the same result as if they ran one at a time. It uses Serializable Snapshot Isolation (SSI) -- tracking read and write dependencies between transactions.

-- Classic anomaly: write skew
-- Two doctors on call. Rule: at least one must stay.
-- Doctor A checks: "Doctor B is on call? OK, I'll go off call."
-- Doctor B checks: "Doctor A is on call? OK, I'll go off call."
-- Both go off call. Nobody's covering.

-- SERIALIZABLE catches this. One transaction gets aborted.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM on_call WHERE shift = 'tonight';  -- sees 2
UPDATE on_call SET active = false WHERE doctor_id = 'A';
COMMIT;  -- might fail with serialization_failure

The cost: more aborts, more retries, ~20-30% throughput reduction in write-heavy workloads. Worth it for financial systems. Overkill for a comment section.

The Decision

Level Snapshot Scope Anomalies Prevented Retry Logic Needed? Use When
READ COMMITTED Per statement Dirty reads No Default. Most CRUD apps.
REPEATABLE READ Per transaction + Non-repeatable reads Yes (serialization errors) Reports, analytics queries
SERIALIZABLE Per transaction + Phantom reads, write skew Yes (more frequent) Financial transactions, inventory

Spicy opinion: Most teams that pick SERIALIZABLE don't need it. They need an application-level idempotency check. SERIALIZABLE is a database-level hammer for what's usually an application-level nail.


VACUUM: The Janitor You Can't Ignore

Remember those old row versions? They don't clean themselves up. That's VACUUM's job.

When you UPDATE a row, the old version becomes a dead tuple. It sits on disk, invisible to all transactions, wasting space. Multiply this by millions of writes per day and your 10GB table becomes a 50GB table. This is called table bloat.

Before VACUUM:
Page: [live] [dead] [live] [dead] [dead] [live] [dead]

After VACUUM:
Page: [live] [free] [live] [free] [free] [live] [free]

VACUUM marks dead tuple space as reusable but does NOT return it to the operating system. The table file stays the same size. Only VACUUM FULL (which rewrites the entire table with an exclusive lock) actually shrinks it.

Autovacuum

PostgreSQL runs autovacuum by default. It kicks in when:

  • Dead tuples exceed autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * number of rows)
  • Default: 50 + (20% of row count). So a 10M row table triggers at 2,000,050 dead tuples.

For large tables, 20% is way too high. Uber's PostgreSQL deployment famously had issues with bloat before they tuned autovacuum aggressively (and eventually moved to MySQL for other reasons too).

-- Common production tuning for hot tables
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,   -- 1% instead of 20%
    autovacuum_vacuum_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.005
);

Transaction ID Wraparound

PostgreSQL uses 32-bit transaction IDs. That's about 4 billion. When you approach the limit, PostgreSQL must "freeze" old transaction IDs -- essentially rewriting old tuples to mark them as "visible to everyone forever." If autovacuum can't keep up, PostgreSQL will refuse all writes to prevent corruption. This is the dreaded wraparound shutdown.

Samsara hit this in production. The fix: monitor age(datfrozenxid) and make sure it never approaches 2 billion.

-- Check how close you are to wraparound danger
SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY age DESC;
-- If age > 1 billion, you need to investigate immediately

HOT Updates: The Optimization Nobody Talks About

When you update a column that has no index on it, PostgreSQL can perform a Heap-Only Tuple (HOT) update. Instead of creating a completely new row version that must be indexed, it chains the new version to the old one on the same heap page.

Why this matters: a regular update touches the heap AND every index on the table. A HOT update touches only the heap. For a table with 8 indexes, that's 8x less I/O.

-- This table has indexes on (user_id) and (email)
-- Updating 'last_login' (not indexed) qualifies for HOT
UPDATE users SET last_login = NOW() WHERE id = 42;

-- Updating 'email' (indexed) does NOT qualify for HOT
UPDATE users SET email = 'new@example.com' WHERE id = 42;

System design takeaway: When designing your schema, keep frequently-updated columns out of indexes. A "last_seen" timestamp column with an index on it kills HOT updates for every heartbeat write. Remove that index unless you actually query by it.


Locks: What "SELECT FOR UPDATE" Really Does

PostgreSQL has row-level locks, table-level locks, and advisory locks. For system design interviews, you need to know the first and last.

Row-Level Locks

-- Pessimistic locking: grab the row, hold it until COMMIT
BEGIN;
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE;
-- Now no other transaction can UPDATE or DELETE this row
-- (But they CAN still SELECT it -- MVCC!)
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 42;
COMMIT;

FOR UPDATE is the go-to for inventory deduction, seat reservations, anything where you need to check-then-modify atomically. The variant FOR UPDATE SKIP LOCKED is a hidden gem for job queues:

-- Worker grabs one unclaimed job, skipping any locked by other workers
BEGIN;
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

UPDATE jobs SET status = 'processing', worker_id = 'w-123' WHERE id = ...;
COMMIT;

This is how Que, Delayed Job, and other PostgreSQL-backed job queues work internally. It turns PostgreSQL into a simple but effective message queue.

Advisory Locks

Application-level locks that live entirely in PostgreSQL's lock manager. They don't correspond to any table or row.

-- Acquire a lock on a custom "resource" identified by a bigint
SELECT pg_advisory_lock(12345);
-- Do exclusive work...
SELECT pg_advisory_unlock(12345);

-- Try without blocking (returns true/false)
SELECT pg_try_advisory_lock(12345);

Great for distributed cron jobs, migrations, cache rebuilds -- any time you need exactly-once execution across multiple app servers.

Deadlock Detection

PostgreSQL runs a deadlock detector every deadlock_timeout (default: 1 second). When it detects a cycle, it kills the youngest transaction.

Transaction A: locks row 1, waits for row 2
Transaction B: locks row 2, waits for row 1
-- Deadlock! PostgreSQL kills one of them.

Prevention: always lock rows in a consistent order. If you're updating multiple inventory items, sort by product_id first.


PostgreSQL MVCC vs. MySQL InnoDB

Both use MVCC. The implementation differs in ways that matter at scale.

Aspect PostgreSQL MySQL InnoDB
Where old versions live In the same heap table (as dead tuples) In a separate undo log
Cleanup mechanism VACUUM removes dead tuples Purge thread cleans undo log
Update cost Creates new tuple + updates all indexes (unless HOT) Updates in-place, stores old version in undo
Read performance under load Stable -- snapshot reads from heap Can degrade if undo log grows (long transactions)
Bloat risk Table bloat if VACUUM can't keep up Undo log bloat if purge can't keep up
Index impact on updates Every indexed column change = new index entry Clustered index update in-place, secondary indexes use change buffer

The real difference: PostgreSQL's approach is simpler to reason about but requires more maintenance (VACUUM). MySQL's approach is more complex internally but generally self-managing. Neither is strictly better.

For system design interviews, know that PostgreSQL's MVCC means your reads are fast and consistent, but you must plan for VACUUM overhead. If you're doing millions of updates per hour on a single table, you need autovacuum tuning from day one.


Patterns for System Design Interviews

Pattern 1: "How do you handle concurrent purchases?"

Bad answer: "Use transactions."

Good answer: "Use SELECT FOR UPDATE to lock the inventory row, check quantity, decrement, and commit. Under MVCC, other readers still see the old quantity until we commit. For high-contention items (flash sales), consider SKIP LOCKED with a queue pattern to avoid lock pile-ups."

Pattern 2: "How do you ensure consistency across two tables?"

"PostgreSQL transactions are ACID. Wrap the two writes in a single transaction. Under READ COMMITTED (the default), we won't see partially committed state from other transactions. If we need cross-service consistency, that's a distributed transaction problem and PostgreSQL alone won't solve it."

Pattern 3: "What happens to your database under heavy writes?"

"Dead tuples accumulate. We tune autovacuum to run more frequently on hot tables -- dropping vacuum_scale_factor from 0.2 to 0.01 or lower. We monitor n_dead_tup in pg_stat_user_tables and age(datfrozenxid) to avoid wraparound. For extremely write-heavy workloads (>50K writes/sec sustained), we consider partitioning to keep per-partition VACUUM fast."


Trade-offs Table

Decision Upside Downside
MVCC (tuple versioning) Reads never block writes Dead tuples accumulate; requires VACUUM
READ COMMITTED Simple, no retry logic Can see phantom reads between statements
REPEATABLE READ Consistent view for whole transaction Must handle serialization errors in app code
SERIALIZABLE Strongest correctness guarantee 20-30% throughput hit; frequent retries
SELECT FOR UPDATE Simple pessimistic concurrency Lock contention under high concurrency
SKIP LOCKED No lock pile-ups Starvation possible for unlucky rows
Advisory locks Cheap distributed coordination Participates in deadlock detection, but session-level locks are not released on transaction commit -- must unlock explicitly
Aggressive autovacuum Prevents bloat and wraparound More I/O; can interfere with production queries

Interview Gotchas

Gotcha 1: 'Transactions are free'

They're not. Every open transaction holds a snapshot, preventing VACUUM from cleaning up dead tuples created after that snapshot. A single forgotten BEGIN with no COMMIT in a connection pool can cause runaway bloat. Stripe wrote about this exact issue causing a production incident.

Gotcha 2: 'PostgreSQL can handle any write volume'

PostgreSQL is single-primary. All writes go to one node. At somewhere around 50K-100K writes/sec (depending on row size and indexes), you hit CPU or I/O limits. Beyond that, you need partitioning, sharding (Citus), or a different database.

Gotcha 3: 'MVCC means no locks'

MVCC eliminates read-write conflicts. Write-write conflicts still require locks. Two transactions updating the same row will block. Two transactions inserting rows with the same unique key will block. Concurrent access is not lock-free access.

Gotcha 4: 'Just use SERIALIZABLE everywhere'

SERIALIZABLE aborts transactions that might conflict, not just those that do. In a write-heavy workload, your retry rate can exceed 50%. Build retry logic in your application layer, or you'll cascade failures during traffic spikes.


Key Takeaways

  1. MVCC means readers never block writers. But dead tuples are the price. Tune autovacuum.
  2. READ COMMITTED is fine for 95% of apps. Only escalate to REPEATABLE READ or SERIALIZABLE when you have a specific anomaly to prevent.
  3. SELECT FOR UPDATE is your concurrency primitive. Learn it, love it, use SKIP LOCKED for queue patterns.
  4. Monitor three things: dead tuple count, autovacuum last run time, and transaction ID age. If any of these goes red, you have hours -- not days -- to fix it.
  5. "Just use transactions" is not an answer. Know the isolation level, the lock behavior, and the VACUUM cost. That's what separates senior candidates.