Paxos and Multi-Paxos — Simplified

TL;DR

Paxos is the foundational consensus algorithm. Leslie Lamport published it in 1998, and it remains the theoretical bedrock of distributed consensus. The core protocol has three roles (Proposer, Acceptor, Learner) and two phases (Prepare/Promise, Accept/Accepted). It guarantees that a single value is chosen even with failures. Multi-Paxos optimizes this by electing a stable leader who skips the Prepare phase for subsequent rounds, turning consensus from two round-trips to one. Google Chubby and Spanner run Paxos. Most engineers should use Raft instead -- but understanding Paxos is understanding why consensus is hard.

The Problem

Same fundamental problem as Raft: multiple machines need to agree on a value (or a sequence of values). Any machine can crash. Messages can be delayed, reordered, or lost. You must never have two different values both "chosen."

The reason Paxos matters beyond Raft is twofold. First, Paxos came first and most academic literature references it. Second, Paxos is more flexible -- it does not mandate a single leader, it does not prescribe a log structure, and it can be adapted to different topologies. Google used this flexibility when building Spanner, which needed consensus across globally distributed datacenters where a single leader would be a latency bottleneck.

The reason most engineers avoid Paxos is also clear: Lamport's original paper used a metaphor of Greek legislators on the island of Paxos, and the reviewers found it so confusing that the paper sat unpublished for years. His follow-up, "Paxos Made Simple," opens with "The Paxos algorithm, when presented in plain English, is very simple." It was not.

The Algorithm

Roles

Proposer: Proposes a value. Drives the protocol forward.
Acceptor: Votes on proposals. The "memory" of the system.
Learner: Learns the chosen value once consensus is reached.

In practice, a single server often plays all three roles.

Phase 1: Prepare/Promise

Paxos Two-Phase Protocol

Proposer:
  1. Choose a proposal number n (unique, monotonically increasing).
  2. Send Prepare(n) to a majority of Acceptors.

Acceptor (on receiving Prepare(n)):
  3. If n > highest proposal number previously seen:
     - Promise: "I will not accept any proposal with number < n."
     - Reply Promise(n, accepted_value, accepted_number)
       where accepted_value/accepted_number are from the highest-numbered
       proposal this Acceptor has previously accepted (or null if none).
  4. If n <= highest seen: ignore or reject.

Phase 2: Accept/Accepted

Proposer (after receiving Promise from a majority):
  5. If any Promise included an accepted value:
     - Use the value from the highest-numbered accepted proposal.
  6. If no Promise included an accepted value:
     - Use the Proposer's own value.
  7. Send Accept(n, value) to a majority of Acceptors.

Acceptor (on receiving Accept(n, value)):
  8. If n >= highest promised proposal number:
     - Accept the proposal. Store (n, value).
     - Send Accepted(n, value) to all Learners.
  9. If n < highest promised: reject.

Learner:
  10. Once a Learner receives Accepted(n, value) from a majority
      of Acceptors, the value is chosen.

Worked Example: Single-Decree Paxos

Acceptors: A1, A2, A3 (majority = 2)
Proposer P1 wants to propose value "X"

Step 1: P1 sends Prepare(n=1) to A1, A2, A3.
Step 2: A1, A2, A3 have seen no prior proposals.
        All reply Promise(1, null, null).
Step 3: P1 receives Promises from A1, A2 (majority).
        No prior accepted values. P1 uses its own value "X".
Step 4: P1 sends Accept(1, "X") to A1, A2, A3.
Step 5: A1, A2 accept. Send Accepted(1, "X") to Learners.
Step 6: Learner sees Accepted from majority. Value "X" is chosen.

What Happens with Competing Proposers

This is where Paxos earns its complexity.

P1 wants "X", P2 wants "Y". Acceptors: A1, A2, A3.

Step 1: P1 sends Prepare(n=1) to A1, A2.
Step 2: A1, A2 reply Promise(1, null, null).
Step 3: P2 sends Prepare(n=2) to A2, A3.
Step 4: A2 sees n=2 > n=1. Promises not to accept proposals < 2.
        A2 replies Promise(2, null, null). A3 replies Promise(2, null, null).
Step 5: P1 sends Accept(1, "X") to A1, A2.
        A1 accepts (1 >= 1). A2 REJECTS (1 < 2, it promised higher).
        P1 only gets 1 acceptance. Not a majority. Fails.
Step 6: P2 sends Accept(2, "Y") to A2, A3.
        A2 accepts. A3 accepts. Majority. "Y" is chosen.

But what if P1 had already gotten "X" accepted before P2 started?

Step 1: P1 gets Accept(1, "X") accepted by A1, A2. "X" is chosen.
Step 2: P2 sends Prepare(n=2) to A2, A3.
Step 3: A2 replies Promise(2, "X", 1) — it already accepted "X" at proposal 1.
        A3 replies Promise(2, null, null).
Step 4: P2 sees that A2 already accepted "X" at proposal 1.
        P2 MUST propose "X" (the highest-numbered accepted value).
Step 5: P2 sends Accept(2, "X"). Value remains "X".

This is the critical insight: once a value is chosen, any future proposer will discover it and re-propose it. The protocol converges on the chosen value, regardless of how many proposers compete.

The Dueling Proposers Problem (Livelock)

Paxos guarantees safety but not liveness. Two proposers can indefinitely preempt each other:

P1: Prepare(1) → succeeds
P2: Prepare(2) → succeeds, invalidates P1's promises
P1: Accept(1) → rejected (promises are for n >= 2)
P1: Prepare(3) → succeeds, invalidates P2's promises
P2: Accept(2) → rejected (promises are for n >= 3)
... forever ...

This is not a bug -- it is an inherent limitation proved by the FLP impossibility result: no deterministic asynchronous consensus protocol can guarantee both safety and liveness. The practical solution is to elect a distinguished proposer (a leader). Which brings us to Multi-Paxos.

Multi-Paxos

Multi-Paxos Stable Leader Optimization

Single-decree Paxos agrees on one value. Real systems need to agree on a sequence of values -- a replicated log. Multi-Paxos runs a separate instance of Paxos for each log slot.

The key optimization: elect a stable leader. Once a leader is established, it can skip Phase 1 entirely for subsequent log slots because it knows its proposal number is already the highest. This reduces consensus from two round-trips to one.

Without Multi-Paxos optimization:
  Each log entry: Prepare → Promise → Accept → Accepted (2 RTTs)

With stable leader:
  First entry: Prepare → Promise → Accept → Accepted (2 RTTs)
  All subsequent: Accept → Accepted (1 RTT)

This is essentially what Raft does -- Raft is a specific, well-defined version of Multi-Paxos with a strong leader, a specific log structure, and explicit membership change rules.

Multi-Paxos vs Raft

Aspect	Multi-Paxos	Raft
Leader	Optional optimization	Mandatory
Log gaps	Allowed (slots can be filled out of order)	No gaps allowed
Leadership transfer	Implicit (any proposer can try)	Explicit (RequestVote RPC)
Specification	Abstract family of protocols	Concrete, complete protocol
Implementation clarity	Ambiguous in many details	Detailed in paper
Membership changes	Not well specified	Joint consensus / single-server

The "log gaps" difference matters in practice. Multi-Paxos can commit slot 5 before slot 3 if different proposers are driving different slots. Raft never allows this -- the leader drives all slots sequentially. This makes Raft simpler but can be less efficient in certain deployment patterns.

Proof/Correctness Intuition

Paxos safety rests on two properties:

Property 1: At most one value is chosen. Once a value v is accepted by a majority at proposal number n, any higher-numbered proposal that reaches Phase 2 will also propose v. The Prepare phase forces the proposer to discover v because it must contact a majority, and that majority overlaps with the majority that accepted v.

Property 2: A chosen value is never overwritten. This follows from Property 1. Any future proposal will re-propose the same value. The system converges.

The majority quorum is the linchpin. Two majorities of the same set always overlap. This overlap is what transfers information from old proposals to new proposals, preventing conflicting values from both being chosen.

Think of it as a relay race. The baton (the chosen value) is passed from majority to majority. No matter how many proposals compete, the baton never gets duplicated.

Real-World Usage

System	How Paxos is Used
Google Chubby	Lock service for loosely-coupled distributed systems
Google Spanner	Paxos per shard for global consistency, combined with TrueTime
Google Megastore	Modified Paxos for cross-datacenter replication
Apache Zookeeper	Uses Zab (Zookeeper Atomic Broadcast), a Paxos-like atomic broadcast protocol
Microsoft Autopilot	Cluster management within Azure datacenters

Google is the major production user of Paxos. Their "Paxos Made Live" paper (2007) describes the engineering challenges: handling disk corruption, implementing master leases, managing group membership, and dealing with the fact that the original Paxos paper left many practical details unspecified.

The Chubby paper contains a famous quote: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the algorithm description is a compressed summary of the reasoning behind it."

This is why most new systems choose Raft. Not because Paxos is wrong, but because Raft gives you a complete blueprint.

Interview Application

When to mention Paxos:

"How does Spanner achieve global consistency?" -- Paxos per shard, combined with TrueTime for cross-shard transactions.
"What is the difference between Paxos and Raft?" -- Raft is a concrete, opinionated version of Multi-Paxos with a strong leader and no log gaps. Same safety guarantees.
"Why is consensus hard?" -- FLP impossibility. Paxos shows the minimal mechanism needed. Understanding Paxos means understanding why you cannot do better.

What interviewers want to hear:

You understand the two-phase structure (Prepare/Promise, Accept/Accepted).
You know the majority quorum overlap argument.
You can explain why Multi-Paxos (or Raft) is needed for practical systems.
You have an opinion: "I would use Raft because the implementation is better specified."

Do not try to derive Paxos from scratch in an interview. Explain the high-level flow, the majority overlap argument, and pivot to Raft for implementation details.

Trade-offs

Advantage	Disadvantage
Provably correct	Extremely hard to implement correctly
Flexible (no mandatory leader)	Underspecified (many implementation decisions)
Can tolerate leader changes gracefully	Dueling proposers can cause livelock
Foundation of all consensus algorithms	Most engineers never need raw Paxos
Allows log gaps (higher throughput)	Log gaps complicate state machine application

The implementation gap: Lamport's paper describes an algorithm. Building a system requires handling: leader election, membership changes, snapshotting, log compaction, client interaction, reconfiguration, and failure detection. None of these are in the original paper. Google's engineers spent years bridging this gap.

When to choose Paxos over Raft: Almost never, unless you are building a system like Spanner where you need leaderless operation across wide-area networks. In a WAN setting, a single leader creates a latency bottleneck because every write must go through it. Flexible Paxos allows you to have different quorum sizes for Phase 1 and Phase 2, enabling designs where local writes are fast.

Common Mistakes

Paxos and Raft are fundamentally different algorithms

No. Raft is a specific instantiation of Multi-Paxos with a strong leader. The safety guarantees are identical. The difference is in specification completeness and implementation clarity.

The proposer can propose any value it wants

Only if no prior value has been accepted. If the Prepare phase reveals a previously accepted value, the proposer must use that value. This constraint is what prevents conflicting values from being chosen.

Paxos needs all Acceptors to agree

No. Paxos needs a majority. This is the whole point -- it tolerates minority failures. In a cluster of 5 Acceptors, 3 is sufficient. Two can crash and the protocol still works.

Multi-Paxos is just running Paxos multiple times

Technically yes, but the stable-leader optimization changes the practical behavior entirely. Without it, every log entry costs two round-trips. With it, amortized cost is one round-trip. The optimization is what makes the protocol practical.

Paxos guarantees the value will eventually be chosen

Paxos guarantees safety (no two different values chosen) but not liveness (progress). The dueling proposers problem can prevent any value from ever being chosen. You need a leader election mechanism to ensure liveness, at which point you have basically reinvented Raft.

I should implement Paxos from the paper

Unless you are at Google scale and have a team of distributed systems PhDs, use an existing Raft library. The number of correct Paxos implementations in production is very small. The number of incorrect ones that caused data loss is much larger.