Distributed Transactions — 2PC, 3PC, and Saga

TL;DR

When a single operation spans multiple databases or services, you need a distributed transaction. 2PC (Two-Phase Commit) coordinates participants through prepare and commit phases -- it is correct but the coordinator is a single point of failure. 3PC adds a pre-commit phase to reduce blocking but does not fully solve the problem. Sagas abandon atomicity for eventual consistency by breaking a transaction into a sequence of local transactions with compensating actions. 2PC for database shards within one system. Sagas for microservices across teams. The core tension: stronger guarantees require more coordination, which means more latency and more failure modes.

The Problem

User buys a product. Three things must happen:

Deduct $50 from the user's wallet (Payment Service, database A).
Reduce inventory by 1 (Inventory Service, database B).
Create an order record (Order Service, database C).

If the payment succeeds but the inventory deduction fails (out of stock), you must refund the payment. If the order creation fails, you must refund the payment AND restore inventory. All three must succeed, or none must succeed. This is the atomicity requirement.

Within a single database, transactions handle this with ACID guarantees. Across databases or services, there is no shared transaction log, no shared lock manager, and no shared commit point. You need a protocol that coordinates independent participants to agree on "commit" or "abort."

The Algorithm: Two-Phase Commit (2PC)

Roles

Coordinator: Orchestrates the transaction. Decides commit or abort.
Participants: Execute local transactions. Vote to commit or abort.

Phase 1: Prepare (Voting)

Coordinator → all Participants: "PREPARE transaction T"

Each Participant:
  1. Execute the local transaction (but do not commit yet).
  2. Write a PREPARE record to its local WAL.
  3. Acquire all necessary locks.
  4. Reply VOTE-YES (ready to commit) or VOTE-ABORT (cannot commit).

Phase 2: Commit (Decision)

If ALL participants voted YES:
  Coordinator: write COMMIT to its own WAL.
  Coordinator → all Participants: "COMMIT transaction T"
  Each Participant: commit local transaction, release locks.

If ANY participant voted ABORT (or timed out):
  Coordinator: write ABORT to its own WAL.
  Coordinator → all Participants: "ABORT transaction T"
  Each Participant: rollback local transaction, release locks.

Worked Example

Transaction: Transfer $100 from Account A (DB1) to Account B (DB2)

Phase 1:
  Coordinator → DB1: "PREPARE: deduct $100 from Account A"
  Coordinator → DB2: "PREPARE: add $100 to Account B"

  DB1: checks balance >= $100. Yes. Locks row. Writes PREPARE to WAL.
       Reply: VOTE-YES.
  DB2: locks row. Writes PREPARE to WAL.
       Reply: VOTE-YES.

Phase 2:
  Coordinator: both voted YES. Writes COMMIT to WAL.
  Coordinator → DB1: "COMMIT"
  Coordinator → DB2: "COMMIT"

  DB1: commits. Releases lock. Balance updated.
  DB2: commits. Releases lock. Balance updated.

The Blocking Problem

2PC has a fatal flaw: if the coordinator crashes after sending PREPARE but before sending the commit/abort decision, participants are stuck.

Phase 1:
  Coordinator → DB1: "PREPARE"  → DB1 votes YES, holds locks.
  Coordinator → DB2: "PREPARE"  → DB2 votes YES, holds locks.

  Coordinator crashes.

  DB1 and DB2 are now blocked:
  - They voted YES, so they promised to commit if told to.
  - They cannot unilaterally abort (the coordinator might have committed).
  - They cannot unilaterally commit (the coordinator might have aborted).
  - They hold locks, blocking all other transactions on those rows.
  - They wait. Potentially forever.

This is the blocking problem. Participants who voted YES are in a state of uncertainty. They must wait for the coordinator to recover and replay its WAL to determine the decision. In practice, coordinators have a recovery timeout, but during that window, locks are held and affected data is inaccessible.

Why this matters in practice: A coordinator crash during the 10ms between "receive all votes" and "write COMMIT to WAL" causes all participants to hold locks until the coordinator recovers. If the coordinator is on a machine that takes 5 minutes to reboot, affected rows are locked for 5 minutes. In a high-throughput system, this cascades to timeouts and failures across the entire application.

The Algorithm: Three-Phase Commit (3PC)

3PC adds a pre-commit phase between voting and committing to reduce blocking.

Phase 1: Same as 2PC (Prepare/Vote)

Phase 2: Pre-Commit
  If all voted YES:
    Coordinator → Participants: "PRE-COMMIT"
    Participants acknowledge.
  If any voted ABORT:
    Coordinator → Participants: "ABORT"

Phase 3: Commit
  Coordinator → Participants: "COMMIT"
  Participants commit.

The idea: if a participant has received PRE-COMMIT, it knows the coordinator intended to commit. If the coordinator crashes, participants in the PRE-COMMIT state can elect a new coordinator and proceed with the commit. Participants that have NOT received PRE-COMMIT can safely abort.

Why 3PC Is Not Used in Practice

3PC solves blocking in the absence of network partitions. But in a real network, partitions happen. During a partition, one group of participants might be in PRE-COMMIT state while another group has not received PRE-COMMIT. If the partitioned groups independently elect new coordinators, one group commits and the other aborts. Data inconsistency.

3PC trades the blocking problem for a partition-safety problem. Since network partitions are common in distributed systems, this trade-off is unfavorable. The industry consensus: use 2PC with a highly available coordinator (replicated via Raft/Paxos), or use Sagas.

The Algorithm: Saga Pattern

A Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect.

Structure

Transaction T = T1, T2, T3
Compensations   C1, C2, C3

Success path: T1 → T2 → T3 → done
Failure at T3:  T1 → T2 → T3 (fails) → C2 → C1 → done

Each Ti is a local transaction (committed independently).
Each Ci undoes the effect of Ti.

Worked Example: Order Saga

T1: Reserve inventory          C1: Release inventory
T2: Charge payment             C2: Refund payment
T3: Create order               C3: Cancel order

Happy path:
  T1: Reserve 1 unit of "Widget" → success (inventory: 99)
  T2: Charge $50 from wallet    → success (balance: $450)
  T3: Create order #1234        → success
  Done.

Failure at T2:
  T1: Reserve 1 unit            → success (inventory: 99)
  T2: Charge $50                → FAIL (insufficient funds)
  C1: Release 1 unit            → success (inventory: 100)
  Done. Order never created.

Failure at T3:
  T1: Reserve 1 unit            → success (inventory: 99)
  T2: Charge $50                → success (balance: $450)
  T3: Create order              → FAIL (database error)
  C2: Refund $50                → success (balance: $500)
  C1: Release 1 unit            → success (inventory: 100)
  Done. All effects reversed.

Choreography vs Orchestration

Choreography: Each service listens for events and reacts. No central coordinator.

Inventory Service → emits "InventoryReserved" event
Payment Service   → listens for "InventoryReserved", charges payment
                  → emits "PaymentCharged" event
Order Service     → listens for "PaymentCharged", creates order

Pros: No single point of failure. Services are decoupled. Cons: Hard to understand the flow. Hard to add steps. Hard to debug.

Orchestration: A central orchestrator calls each service in sequence and handles failures.

Saga Orchestrator:
  1. Call Inventory Service: reserve()
  2. Call Payment Service: charge()
  3. Call Order Service: create()
  On failure at step N: call compensations for steps N-1 down to 1.

Pros: Clear flow. Easy to add steps. Easy to debug. Cons: Orchestrator is a coordination point (but not a SPOF if it persists its state).

In practice, orchestration is far more common in large organizations. The debugging advantage alone is worth the coordination cost.

The Compensation Problem

Compensation is not always a clean "undo."

T: Send email notification     C: ??? (cannot unsend email)
T: Ship physical package        C: ??? (cannot un-ship a package en route)
T: Post to external API         C: ??? (external system may not support rollback)

Semantic compensation: Instead of undoing, perform a corrective action.

Cannot unsend email? Send a follow-up: "Please disregard the previous email."
Cannot un-ship? Arrange a return pickup.
External API does not support rollback? Record the discrepancy for manual reconciliation.

This is messy. It is also reality. The Saga pattern acknowledges that distributed systems are messy, and gives you a framework for managing the mess rather than pretending it does not exist.

Proof/Correctness Intuition

2PC Safety

2PC guarantees atomicity because the commit decision is centralized. The coordinator makes one decision (commit or abort) and persists it to its WAL before broadcasting. All participants eventually learn the decision. No participant can commit while another aborts, because both received the same decision from the coordinator.

The liveness concern (blocking) is separate from safety. 2PC is always safe (never inconsistent), but it may not be live (may block).

Saga Consistency

Sagas do NOT provide atomicity. Between T1 completing and T3 completing, the system is in an intermediate state (inventory reserved but order not yet created). This intermediate state is visible to other transactions. This is called lack of isolation.

Sagas provide eventual consistency: either all forward transactions complete, or all compensating transactions complete. The system eventually reaches a consistent state. But during execution, it may be inconsistent.

Real-World Usage

Pattern	System/Use Case
2PC	Google Spanner (across Paxos groups within shards)
2PC	PostgreSQL (across partitions within one instance)
2PC	MySQL XA transactions (across storage engines)
Saga	Uber (trip lifecycle: request → match → ride → payment)
Saga	Amazon (order placement across microservices)
Saga	Netflix (Conductor orchestration framework)
Saga	Airbnb (booking across host, guest, payment services)

Spanner's approach: Google Spanner uses 2PC for transactions that span multiple Paxos groups (shards). The coordinator is itself replicated via Paxos, making it highly available. This eliminates the blocking problem in practice: the coordinator almost never "crashes" because it is a Paxos group that can survive minority failures. This is the gold-standard solution -- 2PC with a fault-tolerant coordinator -- but it requires the infrastructure to run Paxos.

Interview Application

When to mention each pattern:

"Design a payment system across microservices." -- Saga with orchestration. Compensating transactions for refunds.
"How does a distributed database handle multi-shard transactions?" -- 2PC with a replicated coordinator (Spanner model).
"How do you ensure consistency across services?" -- Saga for eventual consistency. 2PC only if you control all participants and need strong consistency.

Decision framework:

All participants are database shards you control?
  → 2PC (with replicated coordinator)

Participants are independent microservices?
  → Saga (orchestration > choreography)

Need strong consistency (financial, regulatory)?
  → 2PC or serializable transactions within a single database

Latency-sensitive, can tolerate temporary inconsistency?
  → Saga

What interviewers want to hear:

You know 2PC blocks if the coordinator fails.
You know Sagas are not atomic -- intermediate states are visible.
You can explain compensating transactions and their limitations.
You understand that 3PC is a theoretical improvement that fails in practice due to partitions.
You have an opinion on choreography vs orchestration (orchestration is usually the right answer for complex flows).

Trade-offs

Saga Pattern

Aspect	2PC	3PC	Saga
Atomicity	Yes	Yes	No (eventual consistency)
Blocking	Yes (coordinator failure)	Reduced (but partition issues)	No
Latency	2 round-trips	3 round-trips	Depends on step count
Complexity	Moderate	High	High (compensation logic)
Isolation	Full (locks held until commit)	Full	None (intermediate states visible)
Practical usage	Databases, Spanner	Almost never	Microservices, workflows
Failure recovery	Coordinator WAL replay	Election of new coordinator	Retry + compensate

The Isolation Problem with Sagas

Sagas lack isolation. During execution, intermediate states are visible. This can cause anomalies:

Saga executing: T1 (deduct balance) → T2 (reserve inventory) → T3 (create order)

Between T1 and T2:
  Another transaction reads the user's balance → sees the deducted amount.
  But the order might not complete (T3 might fail, triggering compensation).
  The other transaction acted on a state that will be rolled back.

Mitigation strategies: - Semantic locking: Add a status field ("pending") during the saga, only change to "committed" when all steps complete. - Commutative operations: Design operations so that intermediate states do not cause problems. - Versioning: Use version numbers to detect and retry conflicting reads.

Common Mistakes

2PC is too slow, always use Sagas

2PC latency is 2 round-trips to participants. For database shards in the same datacenter, this is ~1ms. That is fast enough for most OLTP workloads. Sagas add complexity (compensation logic, intermediate states) that is only justified when participants are independent services with high latency between them.

3PC solves the blocking problem

3PC solves blocking in a crash-only failure model (no network partitions). In a real network with partitions, 3PC can lead to inconsistency. This is why 3PC is not used in production systems.

Saga compensations are just rollbacks

Compensations are forward actions that semantically undo the effect. They are not database rollbacks. A compensation must succeed even if the system state has changed since the original transaction. This requires careful design: compensations must be idempotent (safe to retry), and they must handle the case where the original transaction's effects have been partially overwritten by other transactions.

Choreography is better because there is no single point of failure

Choreography distributes coordination into event handlers, making the transaction flow implicit and hard to debug. When a saga fails halfway through a choreographed flow, determining which compensations to run requires reconstructing the flow from event logs. Orchestration makes the flow explicit and debuggable.

You can use 2PC across microservices

Technically possible, but practically a disaster. 2PC requires all participants to hold locks during the protocol. If one microservice is slow or unavailable, all other participants hold locks until the timeout. In a microservices architecture with independent deployment cycles and varying availability, this creates cascading failures. Use Sagas.

Sagas guarantee all compensations will succeed

Compensations can fail too. What if the refund API is down? You need retry logic with exponential backoff, a dead-letter queue for compensations that repeatedly fail, and eventually, manual intervention. The Saga framework must handle "compensation failure" as a first-class concern.