Persistence, Replication, and Cluster Mode

TL;DR

Redis keeps everything in RAM, but "in-memory" doesn't mean "disposable." Between RDB snapshots, AOF logs, leader-follower replication, Sentinel failover, and Cluster mode with 16,384 hash slots, Redis has a full durability and availability stack — each piece with sharp trade-offs you need to articulate in an interview.

The Persistence Problem

Here's the tension. Redis is fast because it's in-memory. But RAM is volatile. Power goes out, process crashes, kernel panic — your data is gone. Every Redis deployment has to answer one question: how much data loss can you tolerate?

If the answer is "none" — you probably shouldn't be using Redis as your primary store. Use PostgreSQL. But if you can tolerate a few seconds of loss (and most cache and session workloads can), Redis gives you two persistence mechanisms that cover different points on the durability-performance spectrum.

RDB Snapshots — The Point-in-Time Photograph

RDB (Redis Database) persistence takes a snapshot of your entire dataset and writes it to a binary file on disk. Think of it as a photograph of your data at a specific moment.

# Trigger a snapshot manually
BGSAVE
# Redis forks the process.
# The child writes the snapshot to dump.rdb.
# The parent keeps serving requests.

# Automatic snapshots (in redis.conf)
save 900 1      # snapshot if at least 1 key changed in 900 seconds
save 300 10     # snapshot if at least 10 keys changed in 300 seconds
save 60 10000   # snapshot if at least 10000 keys changed in 60 seconds

How the Fork Works

This is the part interviewers love to ask about. When Redis runs BGSAVE, it calls the fork() system call. The operating system creates a child process that is an exact copy of the parent — same memory, same data. But here's the trick: the OS doesn't actually copy the memory.

Before fork():
┌──────────────────────────────────┐
│ Redis process (parent)           │
│ ┌────────┬────────┬────────┐    │
│ │ Page 1 │ Page 2 │ Page 3 │    │
│ └────────┴────────┴────────┘    │
└──────────────────────────────────┘

After fork() — Copy-on-Write (COW):
┌──────────────────┐     ┌──────────────────┐
│ Parent process    │     │ Child process    │
│                   │     │ (writes RDB)     │
│ ┌───┐ ┌───┐ ┌───┐│     │ ┌───┐ ┌───┐ ┌───┐│
│ │ P1│ │ P2│ │ P3││     │ │ P1│ │ P2│ │ P3││
│ └─┬─┘ └─┬─┘ └─┬─┘│     │ └─┬─┘ └─┬─┘ └─┬─┘│
└───┼─────┼─────┼──┘     └───┼─────┼─────┼──┘
    │     │     │             │     │     │
    ▼     ▼     ▼             ▼     ▼     ▼
  ┌────┐┌────┐┌────┐  (same physical memory pages)
  │ P1 ││ P2 ││ P3 │
  └────┘└────┘└────┘

Parent writes to Page 2:
  OS copies Page 2 for the child FIRST, then lets the parent modify it.
  Child still sees the original Page 2.

This is copy-on-write (COW). The child reads the original data to write the RDB file. The parent continues serving writes. The OS only copies memory pages that the parent modifies during the snapshot. In practice, if your write rate is moderate, the memory overhead during BGSAVE is 10-30% — not a full doubling.

But here's the sharp edge. If your Redis instance uses 30 GB of RAM and you have a write-heavy workload during the snapshot, the COW overhead can spike. You might temporarily need 40-50 GB of physical RAM to complete the snapshot without the OOM killer stepping in.

Rule of thumb: if your Redis instance uses N GB of RAM, provision at least 1.5-2x that on the host to handle COW overhead during BGSAVE. Misconfigured persistence settings are a common cause of Redis data loss -- if the fork is killed by the OOM killer mid-snapshot, the previous RDB file may be stale or corrupted, and any data since the last successful snapshot is gone.

RDB Strengths and Weaknesses

Strengths:

Compact binary format — fast to load on restart
Perfect for backups (copy the .rdb file off-server)
Low runtime overhead when not snapshotting
Great for disaster recovery — ship .rdb files to S3 hourly

Weaknesses:

Data loss window: you lose everything since the last snapshot. With save 60 10000, that's up to 60 seconds of data.
Fork can be expensive on large datasets (the COW overhead)
Not suitable if you need "almost zero" data loss

AOF — The Write-Ahead Log

AOF (Append Only File) takes the opposite approach. Instead of snapshotting everything, it logs every write command as it happens. Recovery means replaying the log from the beginning.

# Enable AOF in redis.conf
appendonly yes
appendfilename "appendonly.aof"

# fsync policy — this is the critical knob
appendfsync always      # fsync after every write — safest, slowest
appendfsync everysec    # fsync once per second — good compromise
appendfsync no          # let the OS decide when to flush — fastest, riskiest

What the AOF File Looks Like

The AOF file is human-readable. It's literally Redis commands:

*3
$3
SET
$10
user:1001
$5
Alice
*3
$3
SET
$10
user:1002
$3
Bob
*2
$4
INCR
$11
page:views

Every write that hits Redis gets appended to this file. On restart, Redis replays the commands to reconstruct the dataset.

The fsync Decision

This is where the trade-off lives. fsync is the system call that forces buffered data from the OS page cache to the physical disk. Without it, a crash could lose data that was "written" but still sitting in the OS buffer.

Policy	Data Loss Window	Performance Impact	When to Use
`always`	Zero (every command fsynced)	10-20x slower than no fsync	Financial data, billing — but honestly, use PostgreSQL instead
`everysec`	Up to 1 second	Minimal (~2-5% throughput hit)	Default choice. Use this.
`no`	Up to 30+ seconds (OS-dependent)	None	Ephemeral data you can afford to lose

I'd pick everysec for almost every production deployment. It's the sweet spot — you lose at most one second of data in a crash, and the performance hit is barely measurable. always sounds safe, but at that point you're paying in-memory prices for on-disk durability guarantees. Just use a real database.

AOF Rewrite — Compaction

The AOF file grows forever. If you SET the same key 10,000 times, the file has 10,000 entries for that key. Only the last one matters. AOF rewrite compacts the log by reading the current dataset and writing the minimal set of commands to recreate it.

# Trigger manually
BGREWRITEAOF

# Automatic rewrite (in redis.conf)
auto-aof-rewrite-percentage 100     # rewrite when AOF is 2x the size after last rewrite
auto-aof-rewrite-min-size 64mb      # don't bother rewriting if AOF is under 64 MB

The rewrite also uses fork() — same COW mechanics as BGSAVE. The child writes the new AOF file while the parent keeps appending new commands to a buffer. When the child finishes, the parent appends the buffer to the new file and atomically swaps it in.

RDB vs AOF — The Comparison

RDB snapshots vs AOF append-only log comparison

Dimension	RDB	AOF (everysec)
Data loss window	Minutes (depends on save interval)	~1 second
Recovery speed	Fast (load binary file)	Slow (replay every command)
File size	Compact (binary, compressed)	Larger (text commands, even after rewrite)
Runtime overhead	Fork spike during BGSAVE	Continuous small I/O
Disk I/O pattern	Burst (during snapshot)	Steady (append on every write)
Human-readable	No	Yes (useful for debugging)
Best for	Backups, disaster recovery	Minimizing data loss

Hybrid Persistence — Use Both

Since Redis 4.0, you can run RDB and AOF together. Redis 7 made this the default. The AOF rewrite process actually writes an RDB-format preamble followed by AOF commands that arrived during the rewrite.

┌──────────────────────────────────────────────────┐
│                  AOF File (hybrid)                │
│                                                  │
│  ┌────────────────────────────────────────────┐  │
│  │ RDB preamble (binary snapshot)             │  │
│  │ Compact representation of dataset at       │  │
│  │ rewrite time                               │  │
│  └────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────┐  │
│  │ AOF tail (text commands)                   │  │
│  │ Commands that arrived during and after     │  │
│  │ the rewrite                                │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

# Enable hybrid AOF (default in Redis 7+)
aof-use-rdb-preamble yes

This gives you the fast recovery of RDB (load the binary preamble) plus the low data-loss window of AOF (replay the tail). It's the best of both worlds, and I'd recommend it for every production deployment unless you have a specific reason not to.

Replication — Leader-Follower

Redis replication is straightforward: one leader (primary), one or more followers (replicas). The leader accepts all writes. Followers receive a stream of commands from the leader and replay them to stay in sync.

                    Writes
                      │
                      ▼
               ┌─────────────┐
               │   Leader     │
               │  (primary)   │
               └──────┬──────┘
                      │
              ┌───────┼───────┐
              │       │       │
              ▼       ▼       ▼
         ┌────────┐┌────────┐┌────────┐
         │Follower││Follower││Follower│
         │   1    ││   2    ││   3    │
         └────────┘└────────┘└────────┘
              │       │       │
              ▼       ▼       ▼
            Reads   Reads   Reads

# On the follower, point it to the leader
REPLICAOF 192.168.1.100 6379

# Check replication status on the leader
INFO replication
# role:master
# connected_slaves:3
# slave0:ip=192.168.1.101,port=6379,state=online,offset=1234567,lag=0

Async by Default — And That's a Feature

Redis replication is asynchronous. The leader sends a write to followers but doesn't wait for their acknowledgment before responding to the client. This means:

A write acknowledged by the leader might not have reached any follower yet.
If the leader crashes immediately after acknowledging a write, that write is lost.
Followers can lag behind the leader by milliseconds, seconds, or more under heavy load.

Why async? Performance. If the leader had to wait for all three followers to acknowledge every write, latency would spike from microseconds to milliseconds. For a cache, that's unacceptable.

But when you need stronger guarantees, Redis offers WAIT:

# Write a value
SET critical:data "important"

# Block until at least 2 replicas have acknowledged, up to 5000ms
WAIT 2 5000
# Returns the number of replicas that acknowledged
# If timeout expires, returns however many acknowledged so far

WAIT gives you semi-synchronous replication. It doesn't guarantee the write is persisted on the followers (they might have it in memory but not yet fsynced to AOF). But it does guarantee the data exists on multiple machines, which protects against single-node failure.

Full Resync vs Partial Resync

When a follower connects (or reconnects after a brief disconnection), Redis tries a partial resync first. The leader keeps a backlog buffer (default 1 MB) of recent commands. If the follower's offset is still within the backlog, it catches up by replaying just the missing commands.

If the follower is too far behind (offset has fallen off the backlog), Redis does a full resync: the leader runs BGSAVE, sends the RDB file to the follower, and then streams subsequent commands. Full resyncs are expensive — they spike memory and network usage on the leader.

# Increase the backlog to reduce full resyncs
repl-backlog-size 256mb

Sentinel — Automatic Failover

Running a leader with followers is great for read scaling. But what happens when the leader dies? Without automation, someone has to manually promote a follower. At 3 AM. On a weekend.

Redis Sentinel solves this. Sentinel is a separate process (you run 3+ of them for quorum) that monitors your Redis instances and performs automatic failover.

┌───────────┐  ┌───────────┐  ┌───────────┐
│ Sentinel 1│  │ Sentinel 2│  │ Sentinel 3│
└─────┬─────┘  └─────┬─────┘  └─────┬─────┘
      │              │              │
      │    Monitoring + Voting      │
      │              │              │
      ▼              ▼              ▼
┌─────────────┐ ┌────────┐ ┌────────┐
│   Leader    │ │Follower│ │Follower│
│  (primary)  │ │   1    │ │   2    │
└─────────────┘ └────────┘ └────────┘

Failover Flow

Subjective down (SDOWN): One Sentinel can't reach the leader. Maybe the network is flaky. Not enough evidence to act.
Objective down (ODOWN): A quorum of Sentinels (e.g., 2 out of 3) agree the leader is unreachable. Now it's real.
Leader election among Sentinels: One Sentinel is elected to perform the failover (using a Raft-like voting protocol).
Follower promotion: The elected Sentinel picks the best follower (most up-to-date replication offset), promotes it to leader.
Reconfiguration: The other followers are told to replicate from the new leader. Sentinel updates its configuration.
Client notification: Sentinel publishes the new leader address. Clients using Sentinel-aware libraries automatically reconnect.

# sentinel.conf (each Sentinel instance)
sentinel monitor mymaster 192.168.1.100 6379 2
# "mymaster" = name of the deployment
# 2 = quorum (need 2 out of 3 Sentinels to agree on ODOWN)

sentinel down-after-milliseconds mymaster 5000
# Mark as down if no response for 5 seconds

sentinel failover-timeout mymaster 60000
# If failover doesn't complete in 60 seconds, abort and retry

The Split-Brain Problem

What if a network partition isolates the leader from the Sentinels and followers, but clients can still reach the old leader?

                Network Partition
                      ║
  Partition A         ║        Partition B
  ┌──────────┐        ║   ┌───────────┐  ┌────────┐
  │ Old      │        ║   │ Sentinel  │  │Follower│
  │ Leader   │        ║   │ 1, 2, 3   │  │ 1, 2   │
  │          │        ║   └───────────┘  └────────┘
  │ Still    │        ║         │
  │ accepting│        ║         ▼
  │ writes!  │        ║   Follower 1 promoted
  └──────────┘        ║   to NEW Leader
                      ║

The old leader keeps accepting writes. Sentinels promote a new leader. When the partition heals, the old leader discovers it's been demoted — it becomes a follower and discards all writes it accepted during the partition.

This is data loss. Redis documents it. It's a fundamental trade-off of async replication.

Mitigation:

# On the leader: stop accepting writes if too few replicas are reachable
min-replicas-to-write 1
min-replicas-max-lag 10
# Reject writes if fewer than 1 replica has acknowledged data within the last 10 seconds

This doesn't eliminate the split-brain window — it narrows it to min-replicas-max-lag seconds.

Redis Cluster — Horizontal Scaling

Sentinel gives you high availability for a single dataset. But what if your data doesn't fit in one machine's RAM? Or your write throughput exceeds what one leader can handle?

Redis Cluster shards data across multiple leader nodes, each responsible for a subset of the keyspace.

Hash Slots — Not Consistent Hashing

Redis Cluster hash slot distribution across nodes

Here's a common interview mistake: candidates say Redis Cluster uses consistent hashing. It doesn't. Redis Cluster uses a fixed set of 16,384 hash slots.

Key → CRC16(key) mod 16384 → hash slot number → assigned node

Example:
  CRC16("user:1001") mod 16384 = 5649
  Hash slot 5649 is assigned to Node B
  → "user:1001" lives on Node B

┌─────────────────────────────────────────────────────┐
│              16,384 Hash Slots                      │
│                                                     │
│  Node A           Node B           Node C           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│  │Slots     │    │Slots     │    │Slots     │      │
│  │0 - 5460  │    │5461-10922│    │10923-16383│     │
│  └──────────┘    └──────────┘    └──────────┘      │
│       │               │               │             │
│       ▼               ▼               ▼             │
│  ┌────────┐      ┌────────┐      ┌────────┐        │
│  │Replica │      │Replica │      │Replica │        │
│  │  A1    │      │  B1    │      │  C1    │        │
│  └────────┘      └────────┘      └────────┘        │
└─────────────────────────────────────────────────────┘

Why Hash Slots Instead of Consistent Hashing?

Consistent hashing (like what DynamoDB uses) maps keys to a ring and assigns ranges to nodes. When a node is added or removed, only the neighboring ranges are affected. Sounds better, right?

Redis chose hash slots for three reasons:

Deterministic slot assignment. Any client can compute CRC16(key) mod 16384 and know exactly which node owns that key. No ring metadata to synchronize.
Manual control. Operators can move specific slots between nodes. This is critical for rebalancing — you can move 1,000 slots at a time instead of rehashing a ring segment.
Multi-key operations. Redis needs to know, at compile time, whether two keys live on the same node. With hash slots, you use hash tags: {user:1001}:profile and {user:1001}:settings both hash on user:1001, guaranteeing co-location.

# Hash tags force keys to the same slot
SET {order:5543}:items "laptop,mouse"
SET {order:5543}:status "shipped"
# Both live on the same node — safe for MGET or Lua scripts

# Without hash tags, these could land on different nodes
SET order:5543:items "laptop,mouse"   # slot = CRC16("order:5543:items") mod 16384
SET order:5543:status "shipped"       # slot = CRC16("order:5543:status") mod 16384
# Different slots! MGET would fail with CROSSSLOT error.

Resharding — Moving Slots Between Nodes

Adding a node to a Redis Cluster doesn't automatically rebalance data. You have to explicitly migrate slots.

# Using redis-cli
redis-cli --cluster reshard 192.168.1.100:6379 \
  --cluster-from <source-node-id> \
  --cluster-to <new-node-id> \
  --cluster-slots 5461

During migration, Redis uses a MOVED / ASK redirect protocol:

MOVED: "That slot permanently lives on a different node now. Update your routing table." The client caches this.
ASK: "That slot is currently being migrated. Try the target node for this one request only." The client doesn't cache this.

Cluster Gossip Protocol

Cluster nodes communicate via a gossip protocol on a dedicated bus port (data port + 10000). Every node periodically pings random other nodes, sharing its view of the cluster topology. This is how nodes detect failures and propagate slot assignments.

Failure detection works similarly to Sentinel: if a majority of leader nodes mark a node as unreachable (PFAIL → FAIL), its replica is promoted.

Sentinel vs Cluster — When to Use Each

Dimension	Sentinel	Cluster
Data sharding	No — single dataset	Yes — data split across nodes
Max dataset size	Limited by single node's RAM	Sum of all nodes' RAM
Write scaling	Single leader bottleneck	Writes distributed across leaders
Complexity	Low — 3 Sentinels + leader/follower	High — minimum 6 nodes (3 leaders + 3 replicas)
Multi-key operations	Unrestricted	Restricted to same hash slot (hash tags)
Client library support	Mature, simple	Requires cluster-aware client
When to use	Dataset fits in one node, need HA	Dataset too large for one node, or need write scaling

I'd start with Sentinel for any new project. Cluster adds real operational complexity — slot migration, hash tag constraints, cross-slot limitations. Don't pay that tax until your dataset genuinely doesn't fit in a single 256 GB node. Most teams I've seen adopt Cluster prematurely because it sounds more "scalable."

Memory and Performance Implications

Understanding persistence overhead matters for capacity planning:

Scenario	Memory Overhead	Latency Impact
RDB BGSAVE (normal writes)	+10-30% (COW pages)	None during save, brief fork pause
RDB BGSAVE (heavy writes)	Up to +100%	Fork pause can reach 1+ second on large datasets
AOF fsync=everysec	None	~2-5% throughput reduction
AOF fsync=always	None	10-20x latency increase per write
AOF rewrite	+10-30% (fork, same as RDB)	Similar to BGSAVE
Replication (async)	~1 MB backlog buffer default	None (fire-and-forget)
Replication (WAIT)	Same	Adds round-trip to replica

Interview Gotchas

"What happens to in-flight writes during an RDB snapshot?" Nothing. The parent process keeps serving writes normally. Copy-on-write means the child sees a frozen point-in-time view. Modified pages get duplicated by the OS. This is why you need headroom — the fork temporarily increases memory usage.

"Can you run Redis with no persistence at all?" Yes. Set save "" and appendonly no. This is valid for pure caching where losing everything on restart is acceptable. Memcached operates this way by design. Just make sure your application handles cache-miss gracefully.

"What's the minimum Sentinel count?" Three. You need a majority to reach quorum. With two Sentinels, a single Sentinel failure means no quorum and no failover — worse than having no Sentinel at all.

"Can a Redis Cluster node be both a leader and a replica?" No. Each node is either a leader (serving a set of hash slots) or a replica (replicating one leader). A 6-node cluster typically has 3 leaders and 3 replicas.

"Why 16,384 hash slots specifically?" It's a practical limit. The cluster gossip protocol includes a bitmap of all slots (16,384 bits = 2 KB). This bitmap is exchanged in every heartbeat message between nodes. 65,536 slots would mean 8 KB per heartbeat — wasteful for clusters with dozens of nodes pinging each other constantly.

Key Takeaways

Concept	What to Remember
RDB	Point-in-time snapshot via fork + COW. Fast recovery, but minutes of data loss possible.
AOF	Append-only command log. `everysec` fsync is the sweet spot.
Hybrid (RDB + AOF)	Default since Redis 7. RDB preamble for fast load, AOF tail for low data loss. Use this.
Replication	Async by default. WAIT for semi-sync. Full resync is expensive — size the backlog buffer.
Sentinel	Automatic failover. Need 3+ instances. Doesn't shard data.
Cluster	16,384 hash slots, NOT consistent hashing. Minimum 6 nodes. Hash tags for co-location.
Split-brain	Async replication means the old leader can accept writes that get discarded. Use `min-replicas-to-write` to narrow the window.
Start simple	Sentinel before Cluster. Don't shard until one node's RAM isn't enough.

Interview Tip

When discussing Redis persistence, always state your fsync policy and explain the data loss window. Saying "I'd use AOF with appendfsync everysec" and then adding "which means we accept up to one second of data loss on crash" shows the interviewer you understand the trade-off — not just the configuration knob.