Design a Large Scale Cache

Level: L3-L6 Topics: System Design, Caching, Distributed Systems, Reliability

Problem Statement

Design a large-scale, low-latency cache service that sits between an application layer and a backend database. The cache must handle high throughput and provide fast reads while maintaining reasonable consistency.

Background & Constraints

Scale parameters:

10 datacenters spread geographically
100 billion entries in the backend database
10% of entries are "hot" (account for 99% of traffic)
10 million peak QPS (queries per second) globally
Target latency: 25ms at the 99th percentile

Backend database:

Average read latency: 500ms
Maximum throughput: 2 million QPS

Data characteristics:

Key size: 100 bytes
Value size: 1 KB average
Total hot data: ~10 billion entries x 1.1 KB = ~11 TB

Available hardware (per cache server):

32 CPU cores
128 GB RAM
10 x 6 TiB HDD

Examples

Typical request flow:

Client → Cache Lookup
  ├── Cache Hit  → Return value (< 5ms)
  └── Cache Miss → Query DB (500ms) → Store in cache → Return value

Capacity estimation:

Hot data size:      ~10B entries x 1.1 KB ≈ 11 TB
RAM per server:     128 GB usable for cache ≈ 100 GB effective
Servers needed:     11 TB / 100 GB ≈ 110 servers (for one copy)
Per-datacenter:     110 / 10 = 11 servers minimum per datacenter

Hints & Common Pitfalls

Data Partitioning

Consistent hashing: Distribute keys across cache servers using consistent hashing to minimize redistribution when servers are added or removed.
Replication factor: Store each key on R servers for availability. R=2 or R=3 is typical.
Hot key handling: Some keys may be disproportionately hot. Consider replicating hot keys to more servers or using a client-side local cache.

Cache Policies

Eviction: LRU (Least Recently Used) is the standard starting point. Consider LFU (Least Frequently Used) or a hybrid for workloads with scan patterns.
TTL (Time to Live): Set a maximum age for cached entries to bound staleness.
Write policy: Write-through (update cache and DB together) vs. write-back (update cache, async write to DB) vs. cache-aside (application manages cache explicitly).

Latency Considerations

Intra-datacenter network round trip: ~1ms
Cross-datacenter network round trip: 50-200ms
Disk read: 5-10ms (SSD) or 10-20ms (HDD)
Implication: RAM-only cache is essential for the 25ms target. Disk can serve as a second-level cache.

Common Mistakes

Forgetting to account for metadata overhead (hash table, pointers, etc.) — real usable memory is ~60-80% of RAM.
Ignoring thundering herd: when a popular key expires, hundreds of requests simultaneously hit the database. Use request coalescing or short-lived locks.
Assuming network is reliable between datacenters.

Follow-Up Questions

Datacenter failure: One of the 10 datacenters goes completely offline. How does the system handle this? How do you redirect traffic without exceeding the capacity of remaining datacenters or the backend database?
Network partition: Two datacenters become isolated from each other but both remain operational. How do you handle reads and writes? What consistency guarantees can you provide? Discuss the CAP theorem tradeoffs.
Full outage and recovery: The entire cache layer goes down and needs to restart cold. The database can only handle 2M QPS but the application needs 10M QPS. How do you warm the cache without overwhelming the database? Consider graduated traffic shifting and priority-based cache warming.
Multi-level cache: Would adding a client-side (L1) cache in front of the distributed cache (L2) help? What are the benefits and risks? How do you handle invalidation across two cache levels?
Cache efficiency metrics: How do you measure and improve cache hit rate? What monitoring would you put in place? How do you detect and handle cache pollution (large scans evicting useful data)?
Consistency: If a value is updated in the database, how quickly must the cache reflect the change? Discuss eventual consistency vs. strong consistency approaches and their performance implications.