Skip to content

Raft — Leader Election, Log Replication, and Safety

TL;DR

Raft splits consensus into three sub-problems -- leader election, log replication, and safety -- and solves each one in a way that a human can actually understand. Every Raft cluster has at most one leader that can successfully commit entries at any time. The leader accepts all client writes, replicates them to followers, and tells followers when it is safe to apply entries. If the leader dies, a new election happens within a few hundred milliseconds. etcd, CockroachDB, TiKV, and Consul all run Raft under the hood. If you are designing any system that needs a coordination layer, you need to understand Raft.


The Problem

You have five machines. You want them to agree on the same ordered sequence of operations -- a replicated log -- so that every machine's state machine ends up identical. One machine can crash, the network can partition, and clocks can drift. You still need the remaining machines to make progress and never disagree about committed entries.

This is the consensus problem. Before Raft, the standard answer was Paxos. Paxos is correct but notoriously difficult to implement. Diego Ongaro and John Ousterhout designed Raft in 2014 explicitly for understandability. Their key insight: if you force a single leader to handle all decisions, the protocol becomes dramatically simpler.

The practical consequence is real. The etcd codebase that powers every Kubernetes cluster runs Raft. If you cannot explain how leader election and log replication work, you cannot reason about why your Kubernetes control plane went down at 3 AM.


The Algorithm

Server States

Every Raft node is in exactly one of three states:

  • Follower: Passive. Accepts log entries from the leader. Redirects client requests to the leader.
  • Candidate: Actively running an election to become leader.
  • Leader: Handles all client requests. Replicates log entries to followers. Sends periodic heartbeats.

All servers start as followers.

Terms

Raft divides time into terms, numbered with consecutive integers. Each term begins with an election. If a candidate wins, it serves as leader for the rest of the term. If the election results in a split vote, the term ends with no leader, and a new term starts immediately.

Terms act as a logical clock. Every RPC includes the sender's current term. If a server receives an RPC with a higher term than its own, it immediately updates its term and reverts to follower. If a server receives an RPC with a stale term, it rejects the RPC.

This is the mechanism that prevents split-brain. Two leaders cannot coexist in the same term.

Leader Election

Raft Leader Election Flow

A follower becomes a candidate when it has not received a heartbeat from the leader within its election timeout (randomized, typically 150-300ms).

The election process:

1. Increment current term
2. Vote for self
3. Reset election timer
4. Send RequestVote RPCs to all other servers
5. Wait for one of three outcomes:
   a. Receive votes from a majority → become leader
   b. Receive AppendEntries from a valid leader → revert to follower
   c. Election timeout elapses with no winner → start new election

RequestVote RPC contains:

Field Purpose
term Candidate's current term
candidateId ID of the candidate requesting the vote
lastLogIndex Index of candidate's last log entry
lastLogTerm Term of candidate's last log entry

A server grants its vote if and only if:

  1. The candidate's term is >= the server's current term
  2. The server has not already voted in this term (or voted for this same candidate)
  3. The candidate's log is at least as up-to-date as the server's log

That third rule is the election restriction -- the key to Raft's safety. "At least as up-to-date" means: compare last log entries. The one with the higher term wins. If terms are equal, the longer log wins.

Why randomized timeouts matter: Without randomization, all followers would time out simultaneously, all become candidates, split the vote, and repeat forever. Randomization ensures one server almost always times out first and wins the election before others even start.

Worked Example: Leader Election

Cluster: S1, S2, S3, S4, S5
Current leader: S1 (term 3)

1. S1 crashes.
2. S3's election timeout fires first (randomized to 180ms).
3. S3 increments term to 4, votes for itself.
4. S3 sends RequestVote(term=4, lastLogIndex=7, lastLogTerm=3) to S2, S4, S5.
5. S2, S4, S5 check: term 4 > their term 3, and S3's log is up-to-date. Grant votes.
6. S3 receives 4 votes (S3 + S2 + S4 + S5). Majority of 5 = 3. S3 wins.
7. S3 immediately sends empty AppendEntries (heartbeats) to all followers.
8. S2, S4, S5 reset their election timers. Election is over.

Total time: roughly 200-400ms from crash to new leader serving requests.

Log Replication

Raft Log Replication Process

Once a leader is elected, it handles all client requests. Each client request becomes a new entry in the leader's log. The leader then replicates that entry to followers via AppendEntries RPCs.

AppendEntries RPC contains:

Field Purpose
term Leader's current term
leaderId So followers can redirect clients
prevLogIndex Index of log entry immediately preceding new ones
prevLogTerm Term of prevLogIndex entry
entries[] Log entries to store (empty for heartbeats)
leaderCommit Leader's commit index

The replication process:

1. Client sends "SET x = 5" to leader.
2. Leader appends entry (term=4, index=8, command="SET x=5") to its local log.
3. Leader sends AppendEntries to all followers in parallel.
4. Each follower checks: does my log entry at prevLogIndex have term == prevLogTerm?
   - Yes: append the new entries. Reply success.
   - No: reply failure. Leader decrements nextIndex and retries.
5. Once a majority of servers have replicated the entry, the leader commits it.
6. Leader applies committed entry to its state machine, returns result to client.
7. Leader includes updated leaderCommit in next AppendEntries (or heartbeat).
8. Followers apply committed entries to their state machines.

The Commit Index

An entry is committed when the leader has replicated it to a majority of servers. The leader tracks a commitIndex -- the highest log entry known to be committed. Followers learn about committed entries via the leaderCommit field in AppendEntries.

Critical rule: a leader never commits entries from previous terms by counting replicas. It only commits entries from its own current term. Once a current-term entry is committed, all preceding entries are implicitly committed too. This prevents a subtle bug where a leader could commit an old entry, get replaced, and the new leader could overwrite that entry.

Worked Example: Log Replication

Leader S3 (term 4), commitIndex = 7

1. Client sends "SET y = 10".
2. S3 appends: log[8] = {term: 4, cmd: "SET y=10"}.
3. S3 sends AppendEntries(prevLogIndex=7, prevLogTerm=3, entries=[log[8]]) to S2, S4, S5.
4. S2 and S4 reply success. S5 is slow but eventually replies.
5. Majority achieved (S3 + S2 + S4 = 3 out of 5).
6. S3 sets commitIndex = 8. Applies "SET y=10" to state machine. Returns success to client.
7. Next heartbeat includes leaderCommit=8.
8. S2, S4 apply log[8]. S5 catches up later and applies too.

Log Matching Property

Raft maintains two invariants:

  1. If two entries in different logs have the same index and term, they store the same command.
  2. If two entries in different logs have the same index and term, all preceding entries are identical.

The first follows from the fact that a leader creates at most one entry per index in a given term. The second is enforced by the prevLogIndex/prevLogTerm consistency check in AppendEntries. This check creates an inductive proof: if the previous entry matches, and we know that entry's predecessor matched, then the entire prefix matches.


Proof/Correctness Intuition

Why the Election Restriction Works

The election restriction ensures that any elected leader already contains all committed entries. Here is the intuition:

For an entry to be committed, it must exist on a majority of servers. For a candidate to win an election, it must receive votes from a majority of servers. These two majorities must overlap in at least one server. That overlapping server will only vote for a candidate whose log is at least as up-to-date. Therefore, the winning candidate must have the committed entry.

This means a new leader never needs to "learn" about committed entries. It already has them. This is much simpler than Paxos, where a newly elected leader must run a recovery protocol to discover what has been committed.

Why Split-Brain Cannot Happen

In any given term, at most one leader can be elected because each server votes at most once per term, and winning requires a majority. Two different candidates cannot both get a majority from the same set of servers.

Across terms, the term number acts as a fencing token. Any server that sees a higher term immediately steps down. A stale leader that was partitioned away will discover it is stale the moment the partition heals and it receives a message with a higher term.


Real-World Usage

System How Raft is Used
etcd Replicated key-value store backing Kubernetes
CockroachDB Consensus per range (each ~512MB range has its own Raft group)
TiKV Storage layer of TiDB, Multi-Raft with one group per Region
Consul Service discovery and configuration, Raft for consistency
RethinkDB Replicated tables use Raft for leader election

CockroachDB and TiKV both use Multi-Raft: they run thousands of independent Raft groups, one per data range. This avoids the bottleneck of a single Raft group handling all writes. The trick is batching Raft messages across groups to reduce network overhead.

etcd's Raft implementation is widely reused. If you see a Go project that needs consensus, it probably imports go.etcd.io/raft.


Interview Application

When to mention Raft:

  • "How does your metadata service stay consistent?" -- Leader-based consensus via Raft.
  • "What happens when the leader of your coordination service dies?" -- Raft election, new leader in ~300ms, committed entries are safe.
  • "Why not just use a database for coordination?" -- Databases use Raft (or Paxos) internally. You are just pushing the problem down a layer.
  • "How does Kubernetes handle control plane failures?" -- etcd uses Raft. As long as a majority of etcd nodes are up, the control plane functions.

What interviewers want to hear:

  1. You understand that Raft requires a majority (2f+1 nodes to tolerate f failures).
  2. You know elections take hundreds of milliseconds, not seconds.
  3. You understand the difference between an entry being replicated and being committed.
  4. You can explain why the system is safe during a partition (minority side stops accepting writes, majority side elects a new leader).

Trade-offs

Advantage Disadvantage
Easy to understand and implement Single leader = write bottleneck
Strong consistency guarantees Requires majority alive (3/5, 2/3)
Fast leader election (~300ms) Not suitable for wide-area (latency)
Mature implementations available Reads from leader by default (can be relaxed)
Deterministic log ordering Membership changes are tricky (joint consensus)

On membership changes: Raft's original paper uses a two-phase joint consensus approach for adding or removing servers. In practice, most implementations (including etcd) use single-server changes -- add or remove one server at a time. This is simpler and avoids the complexity of joint configurations.

On read performance: By default, reads must go through the leader to guarantee linearizability. This makes the leader a read bottleneck. Two common optimizations:

  • Read Index: Leader confirms it is still leader by exchanging heartbeats, then serves the read without committing a log entry.
  • Lease-based reads: Leader assumes it remains leader for a bounded time after the last heartbeat acknowledgment. Faster but relies on bounded clock drift.

Common Mistakes

Raft guarantees availability during any failure

Wrong. Raft requires a majority of nodes to be reachable. In a 3-node cluster, losing 2 nodes means the cluster cannot make progress. This is the fundamental trade-off of consensus: safety over availability.

The client gets a response as soon as the leader writes to its log

No. The leader writes to its log, replicates to a majority, commits, applies to the state machine, and then responds. A premature response before commitment means the entry could be lost if the leader crashes.

Any follower can become leader

Only followers whose logs are at least as up-to-date as a majority can win an election. The election restriction exists precisely to prevent a stale follower from becoming leader and overwriting committed entries.

Raft and Paxos have fundamentally different guarantees

They provide the same safety guarantees. Raft is a specific, opinionated protocol. Paxos is a family of protocols. Raft is essentially Multi-Paxos with a strong leader and a specific log structure. The difference is understandability and implementability, not correctness.

Just add more nodes for better fault tolerance

More nodes means higher write latency (must replicate to a larger majority) and more network traffic. The sweet spot for most systems is 3 or 5 nodes. Going to 7 is rare. Going beyond 7 almost never makes sense for a consensus group.

Raft handles network partitions gracefully

Raft handles partitions safely, not gracefully. The minority partition becomes completely unavailable. Clients connected to the minority side get no responses. This is the correct behavior (better than split-brain), but it is not "graceful" -- it is a hard stop for affected clients.