Tunable Consistency, Hinted Handoff, and Repair

TL;DR

Cassandra lets you dial consistency per query — from "fire and forget" to "all replicas must agree" — and the formula R + W > RF is the one-liner that determines whether you get strong consistency or not.

What It Is

LSM Tree

Most databases give you one consistency model. PostgreSQL gives you strong consistency. Redis gives you eventual consistency. Take it or leave it.

Cassandra gives you a knob. For every read and every write, you choose how many replicas must participate before the operation succeeds. This is tunable consistency. The same cluster can serve strongly consistent reads for payment processing and eventually consistent reads for analytics dashboards. Same data, different guarantees.

This flexibility is what drew companies like Netflix, Apple, and Discord to Cassandra. Netflix runs hundreds of Cassandra clusters across three AWS regions. Some queries demand consistency. Others prioritize availability and speed. Tunable consistency lets them do both on the same infrastructure.

When to Use Tunable Consistency

Scenario	Consistency Level	Why
Shopping cart updates	QUORUM write + QUORUM read	Must not lose items, can tolerate small delay
Session tokens	ONE write + ONE read	Speed matters, losing a session isn't catastrophic
Financial ledger	ALL write + ALL read	Cannot tolerate any inconsistency
Analytics counters	ONE write + ONE read	Approximate is fine, speed is everything
User profile display	ONE read	Slightly stale data is acceptable
User profile update	QUORUM write	Update must be durable

Strong consistency in Cassandra is possible but fragile. If you truly need it for every operation, PostgreSQL or CockroachDB is a better fit. Cassandra's sweet spot is workloads where most operations can tolerate eventual consistency, with a few needing something stronger.

Internals — Replication

Replication Factor

The replication factor (RF) determines how many copies of each partition exist across the cluster. RF is set per keyspace (Cassandra's equivalent of a database).

CREATE KEYSPACE messaging
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

With RF=3 in each datacenter, every partition has 3 copies per DC. If a node dies, two copies remain. If two nodes die simultaneously, one copy survives.

How replicas are placed:

Consistent Hashing Ring (RF=3):

Token Space: [0, 2^63]

Node A (token 0)    → owns range [D's token + 1, A's token]
Node B (token 2^62) → owns range [A's token + 1, B's token]
Node C (token 2^63) → owns range [B's token + 1, C's token]
Node D (token -2^62) → owns range [C's token + 1, D's token]

For a partition hashing to token 100:
  Primary replica:  Node A (owns the range)
  Second replica:   Node B (next node clockwise)
  Third replica:    Node C (next after B)

The primary replica is the node that owns the token range. The second and third replicas are the next nodes clockwise on the ring. With NetworkTopologyStrategy, Cassandra ensures replicas land on different racks to survive rack-level failures.

Consistency Levels

Level	Writes	Reads	Meaning
ONE	1 replica acknowledges	1 replica responds	Fastest, least durable
TWO	2 replicas acknowledge	2 replicas respond	Moderate
QUORUM	⌊RF/2⌋ + 1 acknowledge	⌊RF/2⌋ + 1 respond	Strong with QUORUM read+write
LOCAL_QUORUM	Quorum within local DC	Quorum within local DC	Strong within a DC, no cross-DC latency
ALL	All replicas acknowledge	All replicas respond	Slowest, strongest
ANY	At least 1 node (including hints)	N/A (writes only)	Fire and forget

With RF=3, QUORUM means 2 replicas. The formula:

QUORUM = floor(RF / 2) + 1
       = floor(3 / 2) + 1
       = 1 + 1
       = 2

The Strong Consistency Formula

R + W > RF → Strong Consistency

Where R = read consistency level, W = write consistency level, RF = replication factor.

With RF=3:

Read	Write	R + W	> RF?	Consistent?
QUORUM (2)	QUORUM (2)	4	> 3 ✓	Yes
ONE (1)	ALL (3)	4	> 3 ✓	Yes
ALL (3)	ONE (1)	4	> 3 ✓	Yes
ONE (1)	ONE (1)	2	> 3 ✗	No
ONE (1)	QUORUM (2)	3	> 3 ✗	No (equal, not greater)

The most common production configuration: QUORUM reads + QUORUM writes. It gives you strong consistency while tolerating one replica failure on both reads and writes.

Gotcha

R + W > RF guarantees you read the latest write if no failures occur during the write. If a write partially succeeds (reaches 1 replica but not the quorum), the data is inconsistent until anti-entropy repair catches it. This is why Cassandra's strong consistency is weaker than PostgreSQL's — it's probabilistic, not transactional.

Write Path — Where Data Actually Goes

Client
  │
  ▼
Coordinator Node (any node in the cluster)
  │
  ├──── Write to Replica 1 ────→ Commit Log → Memtable
  ├──── Write to Replica 2 ────→ Commit Log → Memtable
  └──── Write to Replica 3 ────→ Commit Log → Memtable
  │
  ▼
Wait for W replicas to acknowledge
  │
  ▼
Return success to client

On each replica, the write hits two places:

Commit log — append-only file for durability. If the node crashes before flushing to disk, the commit log replays on restart.
Memtable — in-memory sorted data structure. Serves reads until flushed to disk.

When the memtable reaches a size threshold, it's flushed to an SSTable (Sorted String Table) on disk. SSTables are immutable — once written, never modified. Sound familiar? Same principle as Lucene segments.

LSM Tree Storage Engine

Cassandra uses a Log-Structured Merge (LSM) tree. The design prioritizes write speed over read speed.

Write Path:                          Read Path:

Write ──→ Commit Log (durability)    Read ──→ Memtable (in-memory)
      ──→ Memtable (in-memory)              ↓ miss
              │                        → Bloom Filter (skip SSTables that
              │ flush                     definitely don't have the key)
              ▼                              ↓ possible match
         SSTable L0 (small)            → Partition Index → Data Block
         SSTable L0
              │ compaction
              ▼
         SSTable L1 (medium)
              │ compaction
              ▼
         SSTable L2 (large)

Write amplification: data gets written multiple times — first to the commit log, then to the memtable, then to an SSTable, then again during compaction. Each compaction rewrites the data. But individual writes are fast because they're sequential appends.

Read amplification: a read might check the memtable plus multiple SSTables across compaction levels. Bloom filters help — they tell you with certainty when a key is NOT in an SSTable, avoiding wasted disk reads. But the worst case is still checking many files.

This is the fundamental LSM trade-off: fast writes, slower reads. B-trees (used by PostgreSQL) have the opposite trade-off: slower writes, faster reads.

Hinted Handoff — Handling Temporary Failures

When a replica is down, the write still needs to reach it eventually. Cassandra uses hinted handoff to bridge the gap.

Normal Write (all replicas up):
Client → Coordinator → Replica A ✓
                     → Replica B ✓
                     → Replica C ✓

Write with Replica C down:
Client → Coordinator → Replica A ✓
                     → Replica B ✓
                     → Replica C ✗ (down)
                     → Store hint for C on Coordinator

When Replica C comes back:
Coordinator → Replay hint → Replica C ✓ → Delete hint

The coordinator stores the missed write as a "hint" — a small record containing the write payload and the target replica. When the dead replica rejoins the cluster, the coordinator replays all stored hints.

Hints have a TTL. Default is 3 hours. If the replica stays down longer than that, hints expire and the data is lost on that replica until the next full repair.

Consistency level ANY counts a hint as a successful write. This means the write is acknowledged even if NO replicas actually have the data — only the coordinator's hint store. This is the least durable option. Use it only for truly disposable data.

Interview Tip

If an interviewer asks "what happens when a Cassandra node is down for a week?" — the answer is that hinted handoff won't save it (hints expire after 3 hours by default). You need to run a full repair (nodetool repair) on that node when it comes back to ensure it has all the data it missed.

Read Repair — Fixing Staleness at Read Time

When you read at a consistency level that contacts multiple replicas, Cassandra compares the responses. If they disagree, the most recent value wins (last-write-wins based on timestamp), and stale replicas are updated.

Read at QUORUM (contacts 2 of 3 replicas):

Replica A: user.name = "Alice" (timestamp: 100)
Replica B: user.name = "Alicia" (timestamp: 150)  ← newer wins

Response to client: "Alicia"

Background: Replica A updated to "Alicia" (read repair)

There are two types:

Blocking read repair (default for QUORUM reads): The stale replica is updated before returning to the client. Adds latency but guarantees the inconsistency is fixed immediately.

Background read repair (configurable): The response returns immediately, and the repair happens asynchronously. Faster but the fix isn't guaranteed.

Read repair is optimistic — it only fixes data that's actually read. If nobody reads a stale row, it stays stale until anti-entropy repair catches it. This is why you can't rely on read repair alone for consistency.

Anti-Entropy Repair — The Full Sync

Anti-entropy repair is the nuclear option. It compares all data between replicas using Merkle trees and synchronizes any differences.

Node A                              Node B
┌────────────────┐                 ┌────────────────┐
│ Build Merkle   │                 │ Build Merkle   │
│ Tree over all  │                 │ Tree over all  │
│ local data     │                 │ local data     │
└───────┬────────┘                 └───────┬────────┘
        │                                  │
        └──── Compare root hashes ─────────┘
                       │
              Roots match? ──→ Done! Data is identical.
              Roots differ? ──→ Recurse into subtrees.
                       │
              Find differing leaves ──→ Stream missing data.

A Merkle tree hashes data hierarchically. If the root hashes of two replicas match, all data is identical — no need to compare individual rows. If they differ, you recurse into subtrees to find exactly which partitions are out of sync. This makes repair efficient even for terabytes of data.

Full repair rebuilds the entire Merkle tree and compares everything. Run weekly at minimum.

Incremental repair tracks which SSTables have been repaired and only checks new data since the last repair. Faster, but can accumulate errors if SSTables get corrupted.

Subrange repair repairs a specific token range. Useful for targeted fixes without running a full repair.

Spotify built a custom repair scheduler called Reaper (now open-source as cassandra-reaper) that distributes repair work across the cluster to avoid overwhelming any single node. Netflix also runs incremental repairs continuously across their Cassandra fleet using their own internal tooling.

Tombstones — The Hidden Performance Killer

In Cassandra, deletes don't remove data. They write a tombstone — a marker that says "this data was deleted at this timestamp." The actual data is removed later during compaction.

Before delete:
  user:1001 → {name: "Alice", email: "alice@example.com"}

After DELETE FROM users WHERE user_id = 1001:
  user:1001 → TOMBSTONE (deleted at timestamp 1713400000)

Original data STILL exists in the SSTable.
Tombstone exists in a newer SSTable.
During compaction: both are merged, data is finally removed.

Why tombstones exist: Remember, SSTables are immutable. You can't modify or remove a row from an existing SSTable. The only way to "delete" is to write a newer record (the tombstone) that supersedes the old data during reads and compaction.

Why tombstones are dangerous: Until compaction runs, reads must scan through tombstones. If a partition has thousands of tombstones, every read of that partition must process all of them to determine what's actually alive. This slows reads dramatically.

The Discord nightmare: Discord's Cassandra deployment suffered from tombstone accumulation. Their message deletion pattern was:

User deletes a message → tombstone written
Channel has heavy read traffic → reads scan through tombstones
Tombstones pile up faster than compaction removes them
Read latency spikes from milliseconds to seconds

This was one of several factors that led Discord to migrate from Cassandra to ScyllaDB (a Cassandra-compatible database written in C++ instead of Java). ScyllaDB's compaction is more efficient, and its per-core threading model handles tombstone scanning better.

Tombstone best practices:

Practice	Why
Set `gc_grace_seconds` (default 10 days)	Tombstones can't be removed until gc_grace expires
Avoid deleting individual rows in large partitions	Use TTL instead — expired data is marked with tombstones automatically but more predictably
Monitor with `nodetool tablestats`	Check `tombstone_scanned` counters
Use time-bucketed partitions	Old partitions are dropped entirely (TRUNCATE), no tombstones needed

TTL — Time-Based Tombstones

Instead of explicit deletes, set a TTL (time-to-live) on rows. After the TTL expires, the row becomes a tombstone automatically.

-- Session data that expires after 24 hours
INSERT INTO sessions (session_id, user_id, data)
VALUES ('abc123', 1001, '{"role": "admin"}')
USING TTL 86400;

The row disappears after 86,400 seconds. No explicit DELETE needed. The tombstone is created automatically at expiration time and cleaned up during compaction after gc_grace_seconds.

TTLs are cleaner than explicit deletes for temporal data — sessions, tokens, temporary locks, rate limiter windows. The tombstones are more predictable because they're spread out over time rather than created in bursts.

Compaction Strategies

Compaction is the process of merging SSTables to reclaim space from overwritten and deleted data. Cassandra offers multiple strategies:

Strategy	How It Works	Best For
SizeTiered (STCS)	Merges SSTables of similar size	Write-heavy workloads
Leveled (LCS)	Organizes SSTables into levels (like LevelDB)	Read-heavy workloads
TimeWindow (TWCS)	Groups SSTables by time window	Time-series data

STCS is the default. It's simple: when you have N SSTables of similar size, merge them into one larger SSTable. Good for write-heavy workloads because it minimizes write amplification. Bad for reads because data for a single partition can be spread across many SSTables.

LCS organizes SSTables into levels (L0, L1, L2...). Each level is 10x the size of the previous. Data for a single partition is usually in at most one SSTable per level. This dramatically reduces read amplification — at most 1 SSTable per level needs to be checked. But it increases write amplification because merges happen more frequently.

TWCS groups SSTables by time window (e.g., 1 hour). Within a window, STCS is used. Old windows are never compacted with new windows. This is perfect for time-series data where old data is rarely modified. When the retention period expires, you drop the entire window — no tombstones needed.

Patterns for System Design Interviews

Pattern 1: "How Would You Handle a Cassandra Node Being Down for 2 Days?"

Walk through the recovery:

First 3 hours: Hinted handoff covers it. The coordinator stores hints and replays them when the node returns.
After 3 hours: Hints expire (default TTL). Data written after this point is missing from the downed replica.
Node comes back: Run nodetool repair to sync all data from healthy replicas. This uses Merkle trees to find and fix differences.
During the outage: Reads and writes still work at QUORUM (2 of 3 replicas available). At ONE, they work even with 2 nodes down.

Pattern 2: "Design for Multi-Region Active-Active"

US-East (RF=3)                    EU-West (RF=3)
┌─────────┐                      ┌─────────┐
│ Node A  │    ← replication →   │ Node D  │
│ Node B  │                      │ Node E  │
│ Node C  │                      │ Node F  │
└─────────┘                      └─────────┘

Write from US user:
  LOCAL_QUORUM (2 of 3 US nodes acknowledge)
  Then async replicate to EU

Write from EU user:
  LOCAL_QUORUM (2 of 3 EU nodes acknowledge)
  Then async replicate to US

LOCAL_QUORUM avoids cross-region latency on the write path. The trade-off: a US write and an EU write to the same row at the same time create a conflict. Cassandra resolves it with last-write-wins (LWW) — the write with the higher timestamp survives. This can lose data if clocks are skewed. NTP clock sync is not optional in multi-region Cassandra.

Pattern 3: "When Should You Use ALL Consistency Level?"

Almost never. ALL means every replica must respond. If any single replica is down, the operation fails. You've traded availability for consistency — at which point, you should probably be using PostgreSQL.

The one legitimate use case: initial data migration. When loading data for the first time, use ALL to guarantee every replica has the data. Then switch to QUORUM for normal operations.

Trade-offs Table

Decision	Trade-off
QUORUM reads + writes	Strong consistency but higher latency than ONE
ONE reads + writes	Fastest but risk reading stale data
ALL writes	Strongest durability but any node failure blocks writes
Higher RF (5 vs 3)	More fault tolerance but more storage and write amplification
Short gc_grace_seconds	Tombstones cleaned faster but risk resurrecting deleted data if repair hasn't run
Long gc_grace_seconds	Safer but tombstones accumulate longer
STCS compaction	Fast writes but read amplification
LCS compaction	Fast reads but write amplification
Hinted handoff (longer TTL)	Better coverage for short outages but more storage on coordinator

Tunable Consistency

Interview Gotchas

"Does QUORUM read + QUORUM write guarantee strong consistency?"

Yes, if all replicas participate in the write. R + W > RF means the read set and write set must overlap. With RF=3 and QUORUM (2), any 2 nodes you read from will include at least 1 that received the latest write. But if a write partially fails (reaches 1 of 2 needed replicas), you have a window of inconsistency until repair catches it.

"What's last-write-wins and why is it a problem?"

Cassandra resolves conflicts by timestamp. The write with the highest timestamp wins. If two clients write to the same row at the same time, the one with the slightly later timestamp survives. The other is silently discarded. There's no merge, no conflict notification. This can lose data in concurrent write scenarios. If you need merge semantics, use CRDTs (Conflict-free Replicated Data Types) or an application-level merge strategy.

"Why is DELETE dangerous in Cassandra?"

Deletes create tombstones. Tombstones slow down reads until compaction removes them. Heavy delete workloads can make partitions unreadable. Prefer TTLs for temporal data, or drop entire partitions (time-bucketed tables) when possible. Never run bulk DELETE operations on hot partitions.

"What happens if you never run repair?"

Replicas gradually diverge. Hinted handoff only covers short outages (3 hours default). Read repair only fixes data that's actually read. Unread data stays stale forever. Over weeks, replicas can have significantly different data. Running repair periodically (at minimum every gc_grace_seconds) is not optional — it's mandatory for correctness.

"How does Cassandra compare to DynamoDB for consistency?"

DynamoDB offers eventual consistency (default) and strong consistency (per-read option). It's simpler — two choices, not a spectrum. Cassandra's tunable consistency is more flexible but harder to reason about. DynamoDB also supports transactions (TransactWriteItems) across multiple items — Cassandra has lightweight transactions (LWT) using Paxos, but they're 4x slower than regular writes and not recommended for high-throughput use cases.

"What is read amplification and why does it matter?"

Read amplification is the ratio of data read from disk to data returned to the client. In an LSM tree, a single-row read might check the memtable, then scan bloom filters for 5 SSTables, then read from 2 of them. That's a lot of I/O for one row. Leveled compaction reduces this by ensuring at most one SSTable per level contains a given partition's data.

Key Takeaways

Concept	What to Remember
R + W > RF	The strong consistency formula. QUORUM + QUORUM with RF=3 gives you 2 + 2 > 3.
Hinted handoff	Covers short outages (3h default). Not a replacement for repair.
Read repair	Fixes staleness at read time. Only works for data that's actually read.
Anti-entropy repair	Full Merkle-tree sync. Run at least every gc_grace_seconds.
Tombstones	Deletes aren't free. They slow reads until compaction. Prefer TTLs.
LSM tree	Write-optimized. Fast appends, read amplification on lookups.
LWW conflict resolution	Highest timestamp wins. Can silently lose data in concurrent writes.
LOCAL_QUORUM	Strong consistency within a DC without cross-DC latency.