Design a Distributed Cache

TL;DR

A distributed cache stores frequently accessed data in memory across multiple machines to reduce database load and slash read latency from milliseconds to microseconds. The interesting engineering is not "put data in RAM" -- it is handling thundering herds when a popular key expires, surviving cache server failures without melting your database, and making consistent hashing actually work when you need to add or remove nodes. Facebook's Memcached paper (2013) is the canonical reference here. They ran into every problem you will be asked about in an interview, and their solutions (leases, gutter servers, McRouter) are still the best answers.

The System

A distributed cache sits between your application servers and your database, holding hot data in memory so most reads never touch the database. At its simplest: you check the cache first, return the value if found (cache hit), and fall back to the database if not (cache miss), then populate the cache for next time.

Facebook operated the largest Memcached deployment in history -- around 2013, they had over 800 dedicated Memcached servers holding trillions of items, handling billions of requests per second across multiple data centers. Their cache hit rate was above 99%. Without caching, their MySQL fleet would need to be roughly 100x larger. Every major tech company runs some variant of this architecture. Amazon's ElastiCache, Google's Cloud Memorystore, and Redis Labs all exist because distributed caching is not optional at scale -- it is the difference between a $50K/month database bill and a $5M/month one.

Requirements

Functional

Get: Given a key, return the cached value or indicate a miss. Latency under 1ms for cache hits
Set: Store a key-value pair with an optional TTL (time-to-live)
Delete / Invalidate: Remove a key from the cache (used when the underlying data changes)
Multi-get: Fetch multiple keys in a single round trip (critical for feed rendering -- a single page load may need 100+ cached items)
Support arbitrary value sizes: Small values (user profile, 500 bytes) to medium values (serialized feed, 50 KB). Reject values over 1 MB

Non-Functional

Read latency: Sub-millisecond p99 for cache hits (0.1-0.5ms in-datacenter)
Throughput: 10M cache operations/sec across the cluster
Hit rate: 95%+ for steady-state workloads (this determines how much database load you offload)
Availability: Losing one cache server must not cause a database meltdown. Graceful degradation, not cascading failure
Memory efficiency: Maximize useful data stored per GB of RAM. Overhead per item should be under 100 bytes
Consistency: Eventual consistency with the database is acceptable. Stale reads for a few seconds are fine. Serving stale data for minutes or permanently is not

Back-of-Envelope Math

Workload assumptions:
  10M cache ops/sec (80% reads, 20% writes/invalidations)
  8M reads/sec, 2M writes/sec
  Average value size: 1 KB
  Cache working set: 500M items

Memory:
  500M items * (key ~50 bytes + value ~1 KB + overhead ~56 bytes) = ~530 GB
  At 64 GB per server: need ~9 cache servers minimum
  With replication or headroom: 12-15 servers

Throughput per server:
  10M ops / 12 servers = ~833K ops/sec per server
  A well-tuned Memcached instance handles 1M+ ops/sec
  A Redis instance handles ~200K ops/sec (single-threaded)
  Verdict: Memcached fits; Redis needs more shards

Network bandwidth:
  8M reads/sec * 1 KB = 8 GB/sec read bandwidth
  Across 12 servers: 667 MB/sec per server
  25 Gbps NIC = ~3 GB/sec. Headroom exists but network is not free.

Database fallback load:
  If cache hit rate drops from 99% to 95%: 
    miss traffic goes from 80K/sec to 400K/sec (5x increase)
  If cache hit rate drops to 80%:
    miss traffic jumps to 1.6M/sec (20x increase)
  This is why cache hit rate is the most important metric in the system.

The Naive Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  App Server   │────>│   Memcached  │────>│   Database   │
│  (cache-aside)│<────│  (1 server)  │<────│  (MySQL)     │
└──────────────┘     └──────────────┘     └──────────────┘

Read path:
  value = cache.get(key)
  if value is None:           # cache miss
      value = db.query(key)
      cache.set(key, value, ttl=300)
  return value

Write path:
  db.update(key, new_value)
  cache.delete(key)           # invalidate

This is cache-aside (also called look-aside). The application manages the cache explicitly. For a single-server setup, it works. The application is the only thing that reads and writes the cache, so there is no coordination problem.

Where Does This Break First?

When the single Memcached server runs out of memory (64 GB holds ~60M 1KB items). You add a second server, and now you need consistent hashing to decide which server holds which key. More critically: when a popular key expires, hundreds of app servers simultaneously get a cache miss and all query the database for the same key. That is a thundering herd, and it can take down your database in seconds.

Where It Breaks

Problem 1: Thundering herd on cache miss. A popular key (celebrity profile, trending post) expires. 500 app servers all get cache misses within the same millisecond. All 500 send the same query to the database. The database was handling 1K queries/sec comfortably; now it gets 500 identical queries at once on top of its normal load. If this happens for multiple popular keys simultaneously (common during a cache server restart), the database buckles.

Problem 2: Cache server failure redistributes load. With consistent hashing across 12 servers, each server holds ~8% of keys. When one server dies, its 8% of traffic (833K ops/sec) becomes cache misses. Those misses all hit the database. But the database was sized for a 99% hit rate (80K misses/sec). Suddenly it gets 913K misses/sec -- an 11x spike. The database's query queue fills up, latency spikes to seconds, and your entire application becomes unresponsive.

Problem 3: Stale data after invalidation failure. Your app updates the database, then deletes the cache key. If the delete fails (network blip, cache server temporarily unreachable), the cache serves stale data until TTL expiration. For a 5-minute TTL, a user might see outdated data for 5 minutes. For critical data (account balance, order status), this is unacceptable.

Problem 4: Race condition between concurrent writes. Two app servers update the same key simultaneously:

Server A: db.update(key, "v2"), cache.delete(key)
Server B: reads from DB (gets "v2"), cache.set(key, "v2")
But Server A's cache.delete() executes AFTER Server B's cache.set()
Result: cache is empty, next read gets correct value. OK, this case is fine.

Worse case:
Server A: db.update(key, "v2")
Server B: cache.get(key) -> miss -> db.query(key) -> gets "v1" (A's write not committed yet)
Server A: cache.delete(key)
Server B: cache.set(key, "v1")  -- STALE!

This is the classic "read-your-writes" consistency violation. The cache now holds "v1" until TTL expires, even though the database has "v2."

The Real Design

                    ┌──────────────────────────────────────┐
                    │            McRouter                   │
                    │  (connection pooling + routing)       │
                    └────────────┬─────────────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
     ┌────────v──────┐  ┌───────v───────┐  ┌───────v───────┐
     │  Memcached 1  │  │  Memcached 2  │  │  Memcached N  │
     │  (64 GB)      │  │  (64 GB)      │  │  (64 GB)      │
     └───────────────┘  └───────────────┘  └───────────────┘

                    ┌──────────────────────────────────────┐
                    │         Gutter Pool (failover)        │
                    │  ┌────────┐  ┌────────┐  ┌────────┐  │
                    │  │Gutter 1│  │Gutter 2│  │Gutter 3│  │
                    │  └────────┘  └────────┘  └────────┘  │
                    └──────────────────────────────────────┘

     ┌──────────────┐                       ┌──────────────┐
     │  App Server   │──── cache miss ──────>│   Database   │
     │  (with lease) │<─── lease token ──────│  (MySQL)     │
     └──────────────┘                       └──────────────┘

Cache Patterns

You must know four patterns. The interviewer will ask which one you chose and why.

1. Cache-Aside (Look-Aside)

The application manages the cache. On read: check cache, miss -> read DB, populate cache. On write: update DB, invalidate cache.

Read:  app -> cache.get() -> miss -> DB -> cache.set()
Write: app -> DB.update() -> cache.delete()

Pros: Simple, application has full control, cache only holds data that is actually requested. Cons: Cache miss penalty (extra DB round trip), thundering herd on popular key expiration, potential staleness window between DB write and cache invalidation.

This is what Facebook uses. It is the most common pattern because it gives you the most control.

2. Read-Through

The cache itself fetches from the database on a miss. The application only talks to the cache.

Read:  app -> cache.get() -> miss -> cache fetches from DB -> returns to app
Write: app -> DB.update() -> cache.invalidate()

Pros: Application code is simpler (no cache-miss handling logic). Cache acts as a unified data access layer. Cons: Cache needs to know how to query the database (coupling). Harder to implement custom deserialization or join logic.

3. Write-Through

Every write goes through the cache to the database. The cache always has the latest data.

Write: app -> cache.set() -> cache writes to DB -> ack to app
Read:  app -> cache.get() -> always a hit (in theory)

Pros: Cache is always consistent with DB. No staleness window. Cons: Write latency increases (cache + DB, synchronous). Writes to data that is never read waste cache space. Not suitable for write-heavy workloads.

4. Write-Behind (Write-Back)

Writes go to the cache, which asynchronously flushes to the database in batches.

Write: app -> cache.set() -> ack immediately -> cache batches writes to DB
Read:  app -> cache.get() -> hit

Pros: Very fast writes (memory-speed). Batching reduces DB write load by 10-50x. Cons: Data loss risk -- if the cache server crashes before flushing, writes are lost. Complex to implement correctly. Not suitable for financial or transactional data.

Refresh-Ahead: A variant where the cache proactively refreshes keys that are about to expire, before any client requests them. Reduces cache-miss latency for hot keys. The cache predicts which keys will be requested soon (based on access patterns) and refreshes them in the background. Amazon DynamoDB Accelerator (DAX) uses this approach.

My recommendation: Cache-aside for most interview answers. It is the most widely used, gives you the most control, and the thundering herd problem creates a natural deep-dive opportunity. Mention write-behind if the interviewer asks about write-heavy workloads.

Consistent Hashing with Virtual Nodes

With N cache servers, you need to route each key to the right server. Naive modular hashing (server = hash(key) % N) breaks catastrophically when you add or remove a server -- every key remaps, causing a 100% cache miss rate temporarily.

Consistent hashing maps both servers and keys onto a ring (0 to 2^32). Each key goes to the next server clockwise on the ring. When a server dies, only its keys (1/N of all keys) remap to the next server. When you add a server, only 1/N of keys move.

Problem with basic consistent hashing: if servers are placed randomly on the ring, the distribution is uneven. One server might own 30% of the key space while another owns 5%.

Virtual nodes fix this. Each physical server gets 150-200 virtual nodes placed randomly on the ring. With 12 physical servers and 150 virtual nodes each, you have 1800 points on the ring. The key distribution becomes nearly uniform.

Clarification on "the client does consistent hashing": When we say "the client does consistent hashing," we mean the server-side cache client library (like redis-py, Jedis, or a custom wrapper), NOT the end user's browser. Your application servers contain this library, which hashes the cache key and routes to the correct cache node. The end user has no knowledge of the caching layer.

Physical servers: A, B, C (3 servers)
Virtual nodes: A1, A2, ..., A150, B1, B2, ..., B150, C1, C2, ..., C150
Ring positions: A23 at 0.12, B107 at 0.13, C44 at 0.15, A89 at 0.17, ...

Key "user:123" hashes to 0.14 -> next virtual node is C44 -> route to server C

Redis Cluster uses hash slots instead. Redis divides the key space into 16,384 fixed slots. slot = CRC16(key) % 16384. Each node owns a contiguous range of slots. When you add a node, you move slot ranges (not individual keys). This is simpler than virtual nodes and gives you fine-grained control over data placement.

Why 16,384 slots? It is a balance. Too few slots (e.g., 100) means each slot has too many keys, making rebalancing coarse-grained. Too many slots (e.g., 1M) means the slot-to-node mapping table becomes large and gossip protocol messages between nodes become expensive. 16,384 slots (2 bytes per slot in gossip messages) keeps the mapping table under 32 KB, which fits in a single packet.

Facebook's Lease Mechanism (Thundering Herd Fix)

This is the most important section of this lesson. The thundering herd problem is the #1 failure mode of distributed caches, and Facebook's lease mechanism is the canonical solution.

When a cache miss occurs, the Memcached server issues a lease token (a 64-bit integer) to the requesting client. The token is bound to the key.

Server A: cache.get("user:123") -> MISS, lease_token = 98765
Server B: cache.get("user:123") -> MISS, lease_token = NONE (lease already issued)
Server C: cache.get("user:123") -> MISS, lease_token = NONE

Server A: fetches from DB, cache.set("user:123", value, lease_token=98765) -> SUCCESS
Server B: sees no lease -> waits 10ms, retries cache.get("user:123") -> HIT
Server C: sees no lease -> waits 10ms, retries cache.get("user:123") -> HIT

Only the server holding the lease is allowed to populate the cache. All other servers that get a miss for the same key see that a lease is already outstanding and wait briefly before retrying. In practice, the DB query takes 5-10ms, so by the time servers B and C retry, the cache is already populated.

Result: Instead of 500 servers hammering the database with the same query, exactly 1 server queries the database. The other 499 wait 10ms and get a cache hit. Facebook reported that leases reduced the number of peak cache-miss DB queries by 95%.

Gutter Servers (Cache Failure Resilience)

When a Memcached server dies, consistent hashing moves its keys to the next server on the ring. But that server was already at capacity handling its own keys. Now it receives 2x the traffic and potentially runs out of memory, evicting its own keys. This cascades.

Facebook's solution: gutter servers. A small pool (e.g., 3 servers) that act as temporary replacements.

Normal:  app -> Memcached server 7 -> HIT
Server 7 dies:
         app -> Memcached server 7 -> CONNECTION REFUSED
         app -> Gutter pool (any server) -> MISS (first time)
         app -> DB -> cache.set() on gutter server with SHORT TTL (30 seconds)

Next request:
         app -> Memcached server 7 -> CONNECTION REFUSED
         app -> Gutter pool -> HIT (from gutter cache)

Gutter servers use short TTLs (30 seconds to 2 minutes) so they do not permanently replace the dead server. They are a buffer that absorbs the miss storm while you bring the dead server back or replace it. The short TTL means gutter servers do not grow unboundedly -- stale data auto-expires quickly.

Facebook reported that gutter servers reduced the database query rate during cache server failures from a potential 10-100x spike to less than a 2x spike.

McRouter (Connection Pooling and Routing)

With 500 app servers and 12 Memcached servers, each app server opens connections to all 12 cache servers. That is 6,000 connections -- and each Memcached server handles 500 connections. Connection overhead (memory per connection, context switching) becomes significant.

McRouter is Facebook's open-source proxy that sits between app servers and Memcached. Instead of 500 direct connections per cache server, McRouter pools connections:

500 app servers -> 5 McRouter instances -> 12 Memcached servers
Each McRouter: 100 incoming connections, 12 outgoing = 60 outgoing connections
Total connections per Memcached: 60 (instead of 500)

McRouter also handles routing logic (consistent hashing, gutter failover, multi-get splitting), so the application code is a simple get/set without routing awareness.

Deep Dives

Distributed Cache — Distributed Cache High-Level Design

Deep Dive 1: W-TinyLFU Eviction (What Caffeine Uses)

When cache memory is full, you must evict something. The choice of eviction policy directly impacts your hit rate.

LRU (Least Recently Used): Evict the item that was accessed least recently. Simple, O(1) with a doubly-linked list + hash map. Problem: a one-time scan of 1M items pollutes the entire cache, evicting frequently-accessed items. This is called "cache pollution."

LFU (Least Frequently Used): Evict the item with the fewest accesses. Better for frequency-based workloads, but items that were popular yesterday but are cold today ("stale popular" items) never get evicted because their historical count is high.

W-TinyLFU (Windowed Tiny Least Frequently Used) combines the best of both. It is the eviction policy used by Caffeine (Java's best caching library) and achieves near-optimal hit rates.

Architecture:
┌────────────────────────────────────────────────┐
│  Window Cache (1% of capacity) - LRU           │
│  → New items enter here first                  │
├────────────────────────────────────────────────┤
│  Main Cache (99% of capacity) - Segmented LRU  │
│  ┌──────────────┐  ┌────────────────────────┐  │
│  │  Probation   │──│  Protected (80%)       │  │
│  │  (20%)       │  │                        │  │
│  └──────────────┘  └────────────────────────┘  │
├────────────────────────────────────────────────┤
│  TinyLFU Admission Filter (Count-Min Sketch)   │
│  → Decides if new item should replace evicted  │
└────────────────────────────────────────────────┘

When the window cache is full and needs to evict an item, TinyLFU compares the evicted item's frequency (estimated by a Count-Min Sketch, a probabilistic data structure) against the least-recently-used item in the probation segment. If the incoming item has higher estimated frequency, it gets admitted to the main cache. Otherwise, it is discarded.

The Count-Min Sketch uses 8 bytes per counter and 4 hash functions. For 500M items, the sketch is ~20 MB -- tiny compared to the cache data itself. It is also periodically halved (all counters divided by 2) to decay old frequency data, solving the "stale popular" problem.

Hit rate comparison on real-world traces:

Policy	Wikipedia trace	YouTube trace	Search trace
LRU	37.2%	41.8%	52.1%
LFU	39.1%	45.3%	53.8%
ARC	40.8%	46.1%	55.2%
W-TinyLFU	42.3%	48.7%	57.4%

The difference between LRU and W-TinyLFU is 5-7 percentage points. At 10M ops/sec, that translates to 500K-700K fewer database queries per second. That is real money.

Deep Dive 2: Cold Start and Cache Warming

The scariest moment for a cache-dependent system is a cold start. You deploy a new cluster of 12 cache servers. Hit rate: 0%. All 10M ops/sec go directly to the database. Your database, sized for 100K ops/sec (1% miss rate), immediately dies.

Strategy 1: Gradual rollover

Do not switch all traffic to the new cache cluster at once. Route 1% of traffic to the new cluster, 99% to the old. Increase by 5% every 10 minutes. The new cluster warms up under light load while the old cluster handles the bulk.

T+0 min:  new=1%, old=99%   (new cluster miss rate: 100%)
T+10 min: new=5%, old=95%   (new cluster miss rate: ~60%)
T+30 min: new=20%, old=80%  (new cluster miss rate: ~15%)
T+60 min: new=50%, old=50%  (new cluster miss rate: ~5%)
T+90 min: new=100%, old=0%  (new cluster miss rate: ~2%)

Strategy 2: Pre-warming from database

Before sending any traffic to the new cache, scan the most-accessed keys from the database (or from access logs) and pre-populate them. If you know the top 10M keys (from the old cache's access frequency data), you can warm the cache to ~80% hit rate before it receives any live traffic.

-- Get top keys from access log
SELECT key, access_count FROM cache_access_log 
WHERE timestamp > NOW() - INTERVAL '1 hour'
ORDER BY access_count DESC
LIMIT 10000000;

Strategy 3: Snapshot and restore

Dump the old cache's data (Redis RDB dump, or Memcached lru_crawler metadump) and load it into the new cluster. This gives you a near-100% warm cache instantly. The data might be slightly stale (minutes old), but TTLs will expire stale entries naturally.

Redis supports this natively with BGSAVE (background snapshot) and loading the RDB file on the new instance. Memcached does not natively support dumping, so this strategy is Redis-only.

Deep Dive 3: Cache Consistency in Multi-Region Deployments

Facebook runs Memcached clusters in multiple regions. User writes go to the primary region (e.g., US-East), and MySQL replication propagates changes to secondary regions (EU, AP). The cache in each region must be invalidated when data changes.

The problem: MySQL replication lag between regions is 10-100ms. If a user writes in US-East and the cache in EU is not invalidated, a user reading from EU gets stale data.

Facebook's solution: McSqueal

US-East (primary):
  App writes to MySQL primary
  MySQL binlog captures the write
  McSqueal daemon reads the binlog
  McSqueal sends invalidation to US-East Memcached (local, <1ms)
  McSqueal sends invalidation to EU and AP Memcached (cross-region, 50-100ms)

EU (secondary):
  McSqueal invalidation arrives
  EU Memcached deletes the key
  Next read from EU cache -> MISS -> read from EU MySQL replica

McSqueal reads the MySQL binary log (the same log used for replication) and extracts which cache keys need invalidation. This is more reliable than having the application send invalidations -- if the app crashes after writing to MySQL but before sending the cache invalidation, the binlog-based approach still catches it.

Remaining consistency gap: Between the MySQL write (US-East) and the invalidation arriving at EU (50-100ms), EU serves stale data. Facebook considers this acceptable for most use cases (social media content). For critical data (privacy settings, access control), they route reads to the primary region, bypassing the local cache entirely.

Alternative Designs

Alternative 1: Redis Cluster (In-Process Sharding)

Instead of Memcached + McRouter, use Redis Cluster with built-in sharding (16,384 hash slots). Redis handles routing internally. Each node knows the slot map and redirects clients.

Alternative 2: Application-Embedded Cache (Caffeine/Guava)

Skip the distributed cache entirely. Each app server maintains a local in-memory cache (Caffeine in Java, LRU dict in Python). No network hop, no Redis dependency.

Alternative 3: CDN as Cache Layer

For read-heavy public content, use Cloudflare or CloudFront as the cache. The CDN edge servers cache responses and serve them globally. No application-level caching needed for public content.

Aspect	Memcached + McRouter	Redis Cluster	App-Local (Caffeine)	CDN Edge Cache
Latency (cache hit)	0.2-0.5ms	0.3-0.8ms	~1 microsecond	1-5ms (edge)
Throughput per node	1M+ ops/sec	200K ops/sec	Unlimited (local)	Managed
Consistency	Invalidation-based	Invalidation-based	Per-server only	TTL-based
Memory per server	64-128 GB	64-128 GB	1-4 GB (JVM heap)	Managed
Data structures	Key-value only	Rich (sets, lists, etc.)	Full language support	HTTP responses only
Thundering herd fix	Leases	Lua + SETNX	N/A (local)	N/A (CDN handles it)
Ops complexity	High (McRouter fleet)	Medium (cluster mgmt)	Low (no infra)	Low (managed)

Use Memcached + McRouter when you need extreme throughput (1M+ ops/sec per server) and your data model is pure key-value. Use Redis Cluster when you need data structures (sorted sets, lists, pub/sub) and can accept lower per-node throughput. Use app-local caching as L1 in front of a distributed L2 cache. Use CDN caching for public, read-heavy content.

Scaling Math Verification

10M ops/sec with 12 Memcached servers:

Per server: 833K ops/sec. Memcached handles 1M+ ops/sec on a single instance. 20% headroom.
Memory: 500M items * 1.1 KB = 550 GB. Across 12 servers (64 GB each) = 768 GB capacity. Utilization: 72%. Healthy.
Network: 8M reads * 1 KB = 8 GB/sec cluster-wide. Per server: 667 MB/sec. 25 Gbps NIC supports ~3 GB/sec. Fine.

Multi-get efficiency:

A feed page needs 100 cached items. Without multi-get: 100 sequential round trips * 0.3ms = 30ms. With multi-get: 1 round trip per server, items distributed across ~10 servers (with consistent hashing), so 10 parallel round trips * 0.3ms = 0.3ms (parallel). That is a 100x latency improvement for a single page load.

McRouter connection savings:

Without: 500 app servers * 12 cache servers = 6,000 connections per cache server
With 10 McRouter instances: 10 * 12 = 120 connections per cache server
Connection memory savings: ~5,880 connections * 4 KB per connection = 23 MB per cache server. Not huge in absolute terms, but the context-switching savings at 833K ops/sec are significant.

Failure Analysis

Component	Current capacity	At 10x (100M ops/sec)	Breaks?	Fix
Memcached fleet (12)	1M ops/server/sec	8.3M ops/server needed	Yes	Scale to 120 servers
Memory (768 GB total)	550 GB used	5.5 TB needed	Yes	More servers, or tiered caching (SSD + RAM)
Network per server	667 MB/sec	6.67 GB/sec	Yes	100 Gbps NICs, or more servers to distribute
McRouter (10 instances)	1M conns + routing	10M conns	Maybe	Scale McRouter horizontally
Gutter pool (3)	Handles 1 server failure	Still 1 server failure	No	Scale gutter pool proportionally
Database (on miss)	100K queries/sec	1M queries/sec	Maybe	More read replicas, improve hit rate to 99.5%+

The first bottleneck at 10x is the server fleet size. Going from 12 to 120 servers is a linear scale-out -- Memcached is designed for this. The harder problem is network: at 100M ops/sec with 1 KB average values, you are moving 100 GB/sec of data. At this scale, you need to reduce value sizes (compress, store only IDs and hydrate at the application level) or add an L1 app-local cache that absorbs 50-80% of reads before they hit the network.

Facebook's actual architecture at peak had 800+ Memcached servers. They also used an L1 local cache (in PHP's APC) for the absolute hottest keys, reducing network traffic to Memcached by 30-40%.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
Cache pattern	Cache-aside, explains hit/miss flow	Compares aside/through/behind, chooses with reasoning	Refresh-ahead, discusses when write-behind is safe
Hashing	"Use consistent hashing"	Explains virtual nodes, why modular hashing breaks	Hash slots (Redis 16384), discusses rebalancing mechanics
Thundering herd	"Use a lock"	Describes the problem quantitatively	Facebook's lease mechanism, compares to SETNX-based locking
Failure handling	"Add replicas"	Discusses cascading failure on server loss	Gutter servers, gradual rollover, cold-start warming
Eviction	"Use LRU"	LRU vs LFU trade-offs	W-TinyLFU, Count-Min Sketch, cache pollution scenarios
Consistency	"Delete cache on write"	Race condition in delete-after-write	Binlog-based invalidation (McSqueal), cross-region consistency
Memory efficiency	Not discussed	Calculates memory per item	Slab allocation, jemalloc fragmentation, compression trade-offs
Real-world reference	"Facebook uses Memcached"	Mentions McRouter and connection pooling	Cites the Memcached paper, leases, gutter servers, McSqueal

The single most important thing at any level: understand that a distributed cache is not just "RAM in front of a database." The cache failure modes (thundering herd, cascading failures, consistency gaps) are the actual system design problems. The data structure holding the cached values is the easy part.

References from Our Courses

Redis Persistence and Replication — AOF/RDB persistence and replica failover
Redis Data Structures and Use Cases — data structure selection for cache entries
Consistency and Replication — tunable consistency for cache coherence

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Distributed Cache →