Chain Replication and Alternatives
TL;DR
Chain replication arranges replicas in a linear chain. Writes enter at the head, propagate through each node in sequence, and are acknowledged from the tail. Reads are served exclusively by the tail. This gives you strong consistency with high read throughput -- the tail serves all reads without coordination. The trade-off is higher write latency (writes traverse the entire chain) and sensitivity to any node failure in the chain. CRAQ (Chain Replication with Apportioned Queries) allows any node to serve reads if it has the latest version, dramatically improving read throughput. Microsoft Azure Storage uses chain replication. HDFS uses pipeline replication, which is chain replication's cousin.
The Problem
You want strong consistency (linearizability) from a replicated storage system. Leader-follower replication achieves this: the leader handles all reads and writes, followers replicate asynchronously.
But leader-follower has a problem: the leader is a bottleneck for both reads AND writes. If your workload is read-heavy (99% reads), the leader is overwhelmed by read traffic while followers sit idle. You could allow reads from followers, but then you sacrifice strong consistency (followers may be stale).
Chain replication separates the read and write paths. Writes go to the head. Reads go to the tail. The tail always has the latest committed data. This doubles your throughput: the head handles writes while the tail handles reads, in parallel.
The Algorithm
Structure
Replicas are arranged in a chain:
Write Path
1. Client sends write request to HEAD.
2. HEAD applies the write to its local state.
3. HEAD forwards the write to the next node in the chain.
4. Each node applies the write and forwards to the next.
5. When the TAIL applies the write, it sends an acknowledgment to the client.
The write is considered committed only when the tail has applied it. This is the key: the tail is the source of truth.
Read Path
Reads are always served by the tail, which has all committed writes. No coordination needed. No stale reads possible.
Worked Example
Chain: [A (head)] → [B] → [C (tail)]
Write "SET x = 1":
1. Client → A: SET x = 1
2. A applies SET x = 1 locally. A's state: {x: 1}
3. A → B: SET x = 1
4. B applies SET x = 1 locally. B's state: {x: 1}
5. B → C: SET x = 1
6. C applies SET x = 1 locally. C's state: {x: 1}
7. C → Client: ACK (write committed)
Read "GET x":
1. Client → C: GET x
2. C returns: x = 1
Write "SET x = 2" (in progress):
1. Client → A: SET x = 2
2. A applies, forwards to B.
3. B applies, forwards to C.
(Not yet at C)
Concurrent read "GET x":
1. Client → C: GET x
2. C returns: x = 1 (the write has not reached C yet)
This is correct! The write is not committed until C applies it.
Once C applies "SET x = 2" and ACKs the client, subsequent reads return 2.
Failure Handling
Chain replication requires an external configuration manager (typically a Paxos/Raft group) to detect failures and reconfigure the chain.
Head failure:
Original: [A (head)] → [B] → [C (tail)]
A fails.
Config manager detects failure.
New chain: [B (head)] → [C (tail)]
B becomes the new head. Any writes that A received but did not forward to B are lost. The client must retry. This is safe because those writes were never acknowledged (ACK comes from the tail, not the head).
Tail failure:
Original: [A (head)] → [B] → [C (tail)]
C fails.
Config manager detects failure.
New chain: [A (head)] → [B (tail)]
B becomes the new tail. B may have writes that C had not yet received. Those writes are now committed (B is the tail). Reads shift to B. No data loss because B has everything C had plus possibly more.
Wait -- is this correct? B may have writes that were never ACKed to the client (because C was supposed to ACK). The client may retry those writes. The system must handle idempotent retries (e.g., using write IDs to deduplicate).
Middle node failure:
Original: [A (head)] → [B] → [C (tail)]
B fails.
Config manager detects failure.
New chain: [A (head)] → [C (tail)]
The config manager must ensure A starts forwarding to C. Writes in transit at B may be lost. A must re-send any writes that B had not forwarded to C. This is determined by comparing the sequence numbers at A and C.
Consistency Guarantee
Chain replication provides linearizability when the chain is stable (no failures in progress):
- All writes are totally ordered by the chain traversal order.
- All reads see the most recent committed write.
- The tail is the serialization point for both reads and writes.
During reconfiguration, the config manager coordinates the chain change, and there is a brief period where the system may be unavailable (new tail must catch up). The config manager ensures no reads are served during this transition.
CRAQ: Chain Replication with Apportioned Queries
Standard chain replication restricts reads to the tail, wasting the read capacity of head and middle nodes. CRAQ allows any node in the chain to serve reads.
How CRAQ Works
Each node maintains two versions of each object:
- Clean: The value matches what the tail has committed.
- Dirty: The node has received a write that has not yet been committed (not yet ACKed by the tail).
Read at any node:
If the latest version is clean:
Return the value immediately. (It matches the tail.)
If the latest version is dirty:
Query the tail for the committed version of this key.
Return the tail's version.
Worked Example: CRAQ
Chain: [A (head)] → [B] → [C (tail)]
State: x = 1 (committed, clean on all nodes)
1. Client reads x from B.
B's latest version of x is clean (x = 1). Return immediately.
2. Client writes x = 2 to A.
A applies x = 2 (dirty). Forwards to B.
B applies x = 2 (dirty). Forwards to C.
C applies x = 2 (committed, clean). ACKs back through chain.
B marks x = 2 as clean. A marks x = 2 as clean.
3. During step 2, between B receiving the write and C committing:
Client reads x from B.
B's latest version is dirty (x = 2, not yet committed).
B queries C: "what is the committed version of x?"
C replies: x = 1 (C has not applied the write yet).
B returns x = 1 to client.
4. After C commits:
Client reads x from B.
B's latest version is clean (x = 2). Return immediately.
CRAQ provides the same linearizability guarantee as chain replication, but with much higher read throughput. In the common case (no in-flight writes to the queried key), reads are served locally without contacting the tail.
When CRAQ Falls Back to Tail
CRAQ only contacts the tail when a node has a dirty version. For read-heavy workloads with low write rates, this happens rarely. The read throughput scales linearly with the number of nodes in the chain.
For write-heavy workloads, many keys are frequently dirty, and CRAQ degrades to always querying the tail -- similar to plain chain replication.
Proof/Correctness Intuition
Why Chain Replication Is Linearizable
Linearizability requires that every operation appears to take effect at a single instant in time, and these instants are consistent with real time.
In chain replication, that instant is when the tail applies the write. Before that instant, reads return the old value. After that instant, reads return the new value. Since all reads go to the tail, there is a single serialization point.
CRAQ maintains this by having non-tail nodes check with the tail when they have dirty versions. The tail's committed state is always the linearization point.
Why Head Failures Do Not Lose Committed Data
A write is committed only when the tail applies it. If the head crashes, any writes it received but did not forward are uncommitted and unacknowledged. The client knows (via timeout and lack of ACK) that the write may not have succeeded and retries. No committed data is lost.
Real-World Usage
| System | Replication Style | Notes |
|---|---|---|
| Azure Storage | Chain replication | Within a storage stamp (single DC) |
| HDFS | Pipeline replication | DataNode pipeline for block writes |
| Ceph | Primary-copy (chain-like) | Writes to primary, forwarded to replicas in chain |
| FaRM (Microsoft) | Chain replication | RDMA-based, microsecond latencies |
| FAWN-KV | Chain replication | Energy-efficient key-value store |
HDFS Pipeline Replication
HDFS block writes use a protocol very similar to chain replication:
Client → DataNode1 → DataNode2 → DataNode3
1. Client sends block to DataNode1.
2. DataNode1 forwards to DataNode2 while writing locally.
3. DataNode2 forwards to DataNode3 while writing locally.
4. DataNode3 writes locally, ACKs to DataNode2.
5. DataNode2 ACKs to DataNode1.
6. DataNode1 ACKs to Client.
The key difference from textbook chain replication: HDFS pipelines the data (DataNode1 starts forwarding before it finishes receiving), reducing latency. And reads in HDFS go to any DataNode, not just the tail, because HDFS uses immutable blocks (once written, never modified).
Azure Storage
Microsoft Azure Storage uses chain replication within a storage stamp (a cluster of machines in a single datacenter). Each extent (a sequence of blocks) is replicated across 3 nodes in a chain. Writes go to the head, reads from the tail. The Stream Manager acts as the configuration manager, detecting failures and reconfiguring chains.
For geo-replication (across datacenters), Azure uses a different mechanism (asynchronous replication with eventual consistency), because chain replication's latency across datacenters would be unacceptable.
Interview Application
When to mention chain replication:
- "How would you replicate data with strong consistency AND high read throughput?" -- Chain replication: reads from tail, writes to head. Or CRAQ: reads from any node.
- "How does HDFS replicate blocks?" -- Pipeline replication (chain-like, with pipelining for lower latency).
- "Compare chain replication with leader-follower." -- Chain splits read/write paths, leader-follower funnels both through the leader.
Comparison with leader-follower replication:
| Aspect | Leader-Follower | Chain Replication |
|---|---|---|
| Write path | Leader → all followers | Head → node → ... → tail |
| Read path | Leader (for strong consistency) | Tail (or any node in CRAQ) |
| Write latency | 1 hop (leader to majority) | N-1 hops (length of chain) |
| Read throughput | Limited to leader | Tail (or all nodes in CRAQ) |
| Failure detection | Followers detect leader failure | Config manager detects any failure |
| Reconfig complexity | Leader election | Chain reconfiguration |
What interviewers want to hear:
- You understand the separation of read and write paths.
- You know that the tail is the commit point.
- You can explain why head failures do not lose committed data.
- You know CRAQ and why it improves read throughput.
- You recognize the write-latency cost (entire chain traversal).
Trade-offs

| Advantage | Disadvantage |
|---|---|
| Strong consistency without read bottleneck | Higher write latency (chain traversal) |
| Simple consistency argument (tail = truth) | Any node failure disrupts the chain |
| Tail handles all reads (no coordination) | Need external config manager (Paxos/Raft) |
| CRAQ scales reads linearly | CRAQ degrades under high write rates |
| Pipeline-able for lower latency | Chain length limits throughput |
The chain length problem: Longer chains provide more replicas (better durability) but higher write latency (more hops). In practice, chains of 3 are standard. Chains of 5 are rare. Beyond 5, the write latency becomes a problem.
Compared to quorum-based replication: Raft/Paxos commit a write when a majority has acknowledged. For 3 replicas, a Raft write waits for 2 ACKs (1 hop, fastest two responders). A chain write waits for all 3 in sequence (2 hops, including the slowest). Chain replication has higher write latency but separates the read path, which quorum-based systems do not.
Common Mistakes
Chain replication is strictly better than leader-follower
Chain replication has higher write latency because writes must traverse the entire chain. Leader-follower (with Raft-style quorum) can commit a write after hearing from a majority, ignoring slow replicas. For write-latency-sensitive workloads, leader-follower is better.
CRAQ eliminates the need for a tail
The tail is still the authority for committed state. CRAQ allows other nodes to serve reads only when their data is clean (matches the tail). When data is dirty, they must consult the tail. The tail is still essential.
Chain replication does not need consensus
Chain replication itself is not a consensus protocol. It requires an external configuration manager (which IS a consensus protocol like Raft or Paxos) to detect failures and reconfigure the chain. The chain handles data replication; the config manager handles membership.
If the head crashes, all in-flight writes are lost
In-flight writes that reached intermediate nodes but not the tail are indeed lost. But they were never committed (the tail never applied them), so they were never acknowledged to the client. The client knows the write did not succeed and can retry. No committed data is lost.
Chain replication works well across datacenters
The sequential nature of chain replication amplifies latency. In a chain spanning US-East, EU-West, and AP-Southeast, a write from the US head must traverse to EU (100ms) then to AP (150ms), totaling 250ms before the tail ACKs. Quorum-based replication (wait for 2 of 3) would take max(100ms, 250ms) for the second ACK = 100ms (if US and EU are the fastest two). Chain replication is designed for intra-datacenter use.
CRAQ is always better than plain chain replication
CRAQ adds version tracking and dirty-state management overhead. For write-heavy workloads where keys are frequently dirty, CRAQ nodes almost always fall back to querying the tail, adding network overhead without benefit. Plain chain replication is simpler and faster in these cases.