Redis Patterns for System Design Interviews
TL;DR
Knowing Redis data structures is step one. Knowing the seven patterns that show up repeatedly in interviews — distributed locks, rate limiters, session stores, cache-aside, pub/sub, leaderboards, and distributed counters — is what actually gets you the offer. This lesson covers each pattern with the exact Redis commands, the trade-offs, and the "gotcha" the interviewer is hoping you'll mention.

Pattern 1: Distributed Lock
You have multiple servers. Two of them try to process the same order at the same time. One of them needs to win. That's a distributed lock.
The Basic Pattern
# Acquire lock
SET lock:order:5543 "worker-7-uuid" NX EX 30
# NX = only set if the key doesn't exist (atomic test-and-set)
# EX = auto-expire after 30 seconds (lease timeout)
# Returns "OK" if acquired, nil if someone else holds it
# Do your work...
# Release lock (only if you still own it — use Lua for atomicity)
EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end" 1 lock:order:5543 "worker-7-uuid"
Why the Lua script for release? Because GET + DEL is two operations. Between them, your lock could expire and another worker could acquire it. Then your DEL deletes their lock. The Lua script runs atomically inside Redis — no interleaving possible.
Why include the worker UUID? So you only release your own lock. Without it, Worker A could release Worker B's lock.
The Redlock Controversy
What if your single Redis node crashes while holding the lock? The lock vanishes. Two workers now think they have it.
Antirez (Redis creator) proposed Redlock: acquire the lock on N/2+1 independent Redis nodes. If you get a majority, you hold the lock. If you don't, release everything and retry.
Redlock with 5 independent Redis nodes:
Worker tries to acquire lock on all 5 nodes:
Node 1: ✅ acquired
Node 2: ✅ acquired
Node 3: ✅ acquired ← 3/5 = majority, lock is held
Node 4: ❌ timeout
Node 5: ❌ failed
Then Martin Kleppmann published "How to do distributed locking" — one of the most cited distributed systems blog posts ever. His argument: Redlock doesn't actually work because of clock drift and process pauses.
Consider this scenario:
- Worker A acquires the Redlock.
- Worker A gets paused by a garbage collection event for 35 seconds.
- The lock expires (30-second TTL).
- Worker B acquires the lock legitimately.
- Worker A wakes up, doesn't realize the lock expired, and proceeds as if it still holds it.
- Both workers are now inside the critical section.
Kleppmann's fix: use fencing tokens. Every lock acquisition returns a monotonically increasing token. The storage system (database, queue) rejects operations with a token older than the one it last saw. This requires the downstream system to participate in the locking protocol — Redis alone isn't enough.
My take? For most applications, the single-node SET NX EX pattern is fine. You accept that a Redis crash means a brief window of no lock enforcement. If you need ironclad mutual exclusion, use a coordination service like ZooKeeper or etcd — they're designed for exactly this. Redlock is an uncomfortable middle ground: more complexity than a single node, less safety than a proper consensus system.
When to Use Distributed Locks
- Order processing (prevent double-charging)
- Cron job coordination (only one server runs the daily report)
- Resource reservation (only one user can edit this document)
- Cache stampede prevention (only one worker repopulates the cache)
Pattern 2: Rate Limiter
"Design a rate limiter" is one of the top 5 system design interview questions. Redis gives you three approaches, each with different precision-complexity trade-offs.
Approach 1: Fixed Window with INCR
# Allow 100 requests per minute per user
# Key format: rate:{user_id}:{minute_timestamp}
INCR rate:user:1001:1713400020
# Returns the current count
EXPIRE rate:user:1001:1713400020 60
# Auto-cleanup after the window
# In application code:
# if count > 100: reject request (HTTP 429)
# else: allow
Problem: The boundary burst. A user sends 100 requests at 0:59 and 100 more at 1:00. They've sent 200 requests in 2 seconds, but each window only sees 100. The fixed window says "allowed."
Approach 2: Sliding Window with Sorted Sets
This is the pattern interviewers want to see.
# For each request:
# 1. Add the request timestamp to a sorted set
ZADD rate:user:1001 1713400025.123 "req:uuid-abc"
# 2. Remove entries older than the window (60 seconds ago)
ZREMRANGEBYSCORE rate:user:1001 0 1713399965.123
# 3. Count remaining entries
ZCARD rate:user:1001
# If count > 100: reject (HTTP 429)
# 4. Set expiry on the key itself (cleanup if user goes idle)
EXPIRE rate:user:1001 60
No boundary burst problem. The window slides with each request. The cost is O(log N) per operation instead of O(1), and you store every request ID in the window — higher memory usage.
For a user limited to 100 requests/minute, you're storing at most 100 entries per user. That's fine. For a global rate limiter handling millions of entries per window, this gets expensive.
Approach 3: Token Bucket with Lua Script
The token bucket is the most flexible: it supports burst allowances and smooth refill rates. But it requires a Lua script to be atomic.
-- Lua script for token bucket
-- KEYS[1] = rate limit key
-- ARGV[1] = max tokens (bucket capacity)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = tokens to consume (usually 1)
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local consume = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or max_tokens
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add since last refill
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
if new_tokens >= consume then
new_tokens = new_tokens - consume
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return 0 -- rejected
end
# Usage: allow 10 requests/sec with burst up to 20
EVAL <script> 1 rate:user:1001 20 10 1713400025.5 1
Rate Limiter Comparison
| Approach | Precision | Memory per User | Complexity | Boundary Burst? |
|---|---|---|---|---|
| Fixed window (INCR) | Low | 1 key | Trivial | Yes |
| Sliding window (sorted set) | High | N entries per window | Medium | No |
| Token bucket (Lua) | High | 1 hash (2 fields) | High (Lua script) | No (built-in burst control) |
For interviews, I'd lead with the sliding window sorted set approach. It's the most intuitive to explain on a whiteboard, and interviewers can clearly see how the window slides. Mention the token bucket as a follow-up if they push on burst handling.
Pattern 3: Session Store
Why does every production web app use Redis for sessions? Because the alternative — database sessions — is slow and bloated.
# Store session on login
SET session:abc123xyz '{
"user_id": 1001,
"email": "alice@example.com",
"role": "admin",
"login_at": 1713400000
}' EX 86400
# Expires in 24 hours
# On every request: check session
GET session:abc123xyz
# Returns the JSON blob or nil (expired/invalid)
# On logout: destroy session
DEL session:abc123xyz
# Extend session on activity (sliding expiration)
EXPIRE session:abc123xyz 86400
Why Redis beats a database for sessions:
| Dimension | Database Sessions | Redis Sessions |
|---|---|---|
| Latency | 2-10 ms (disk I/O, query parsing) | 0.1-0.5 ms (in-memory) |
| Cleanup | Cron job to delete expired rows | TTL handles it automatically |
| Scale | Adds load to your primary DB | Separate infrastructure, horizontally scalable |
| Schema | Need a sessions table, migrations | No schema, just SET/GET |
| Sharing across services | Need shared DB access | Any service can hit Redis |
When Heroku migrated their dashboard sessions from PostgreSQL to Redis, they cut session lookup latency by 10x and removed the cron job that was sweeping expired sessions every 5 minutes.
Gotcha
Don't store sensitive data directly in the session value. If your Redis instance is compromised (no authentication, exposed to the internet — which happens more often than you'd think), every active session is readable. Store a session ID in Redis that maps to an opaque token, and keep the actual user data server-side.
Pattern 4: Cache-Aside (Lazy Loading)

This is the most common caching pattern and the one interviewers expect you to know cold.
Read Path
Client → App Server → Check Redis
│
┌─────┴─────┐
│ │
Cache Cache
HIT MISS
│ │
│ Query Database
│ │
│ Write result to Redis
│ SET key value EX 300
│ │
▼ ▼
Return data to client
# Pseudocode for cache-aside read
GET user:1001
# => nil (cache miss)
# Query database
# SELECT * FROM users WHERE id = 1001
# => {name: "Alice", plan: "premium"}
# Populate cache with TTL
SET user:1001 '{"name":"Alice","plan":"premium"}' EX 300
# Next request:
GET user:1001
# => '{"name":"Alice","plan":"premium"}' (cache hit!)
Write Path — Invalidation, Not Update
# User updates their profile in the database
# UPDATE users SET plan = 'free' WHERE id = 1001
# Invalidate the cache entry
DEL user:1001
# Do NOT try to update the cache here
# Next read will cache-miss, query DB, and repopulate with fresh data
Why invalidate instead of update? Because update introduces race conditions.
Consider two concurrent writes:
Timeline:
T1: Worker A updates DB → plan = "premium"
T2: Worker B updates DB → plan = "free"
T3: Worker B updates cache → plan = "free"
T4: Worker A updates cache → plan = "premium" ← STALE!
The database says "free" but the cache says "premium." With invalidation, the cache simply gets deleted, and the next read fetches the correct value from the database.
Cache Stampede (Thundering Herd)
A popular cache key expires. 1,000 concurrent requests all see a cache miss. All 1,000 hit the database simultaneously. The database buckles.
Solutions:
# 1. Lock-based repopulation
SET lock:cache:user:1001 "worker-3" NX EX 5
# If acquired: query DB, populate cache, release lock
# If not acquired: wait briefly, retry GET
# 2. Stale-while-revalidate (probabilistic early expiration)
# Set cache TTL to 350 seconds, but embed the "logical" expiry at 300 seconds in the value
SET user:1001 '{"data":{...},"expires_at":1713400300}' EX 350
# When a reader sees expires_at is in the past:
# - With 10% probability: refresh the cache in the background
# - With 90% probability: return stale data (still within the 350s hard TTL)
The Facebook TAO paper describes how they handle cache stampedes at scale: a lease mechanism where the first client to miss gets a "lease" (permission to populate the cache), and all subsequent clients wait for that lease to be fulfilled.
Pattern 5: Pub/Sub — Fire and Forget
Redis Pub/Sub lets you broadcast messages to all subscribers of a channel. No persistence, no acknowledgment, no replay.
# Subscriber (in one Redis connection)
SUBSCRIBE notifications:user:1001
# Blocks and waits for messages
# Publisher (in another connection)
PUBLISH notifications:user:1001 '{"type":"new_message","from":"Bob"}'
# Returns the number of subscribers that received it
# Pattern-based subscription
PSUBSCRIBE notifications:*
# Receives messages on ANY notifications:* channel
When Pub/Sub Works
- Real-time notifications to connected clients (via WebSocket bridge)
- Cache invalidation across app servers ("key X was updated, invalidate your local copy")
- Chat messages in a small-scale system
- Configuration updates ("feature flag changed, reload")
When Pub/Sub Breaks
Here's the thing people miss: if nobody is listening, the message is gone. Redis Pub/Sub has no queue, no buffer, no replay. If your subscriber disconnects for 5 seconds and 3 messages are published during that time, those messages are lost forever.
If the subscriber is slow and can't keep up with the publish rate, Redis buffers messages in memory. If the buffer exceeds client-output-buffer-limit, Redis kills the subscriber connection. This happened at scale to several companies running Pub/Sub for real-time chat before they switched to Kafka or Redis Streams.
| Feature | Redis Pub/Sub | Redis Streams | Kafka |
|---|---|---|---|
| Persistence | No | Yes | Yes |
| Replay from history | No | Yes | Yes |
| Consumer groups | No | Yes | Yes |
| Delivery guarantee | At-most-once | At-least-once | At-least-once / Exactly-once |
| Backpressure handling | Kill slow subscriber | Consumer controls pace | Consumer controls pace |
| Throughput | High | High | Very high |
My recommendation: Use Pub/Sub only for ephemeral broadcasts where losing messages is acceptable. For anything that needs durability, use Redis Streams or Kafka.
Pattern 6: Leaderboard
Gaming leaderboards, sales rankings, top-N lists — sorted sets handle all of these.
# Add or update scores
ZADD game:leaderboard 1500 "player:alice"
ZADD game:leaderboard 2300 "player:bob"
ZADD game:leaderboard 1800 "player:charlie"
ZADD game:leaderboard 2100 "player:dave"
ZADD game:leaderboard 1950 "player:eve"
# Top 3 (highest scores)
ZREVRANGE game:leaderboard 0 2 WITHSCORES
# => [("player:bob", 2300), ("player:dave", 2100), ("player:eve", 1950)]
# "What's my rank?" (0-indexed)
ZREVRANK game:leaderboard "player:charlie"
# => 3 (4th place)
# Score update — player earns 200 more points
ZINCRBY game:leaderboard 200 "player:charlie"
# charlie is now at 2000, rank updates automatically
# Range query: all players with score between 1800 and 2200
ZRANGEBYSCORE game:leaderboard 1800 2200 WITHSCORES
# Total players on the leaderboard
ZCARD game:leaderboard
# => 5
Scaling Leaderboards
A sorted set with 10 million members uses ~80-120 MB and ZREVRANK runs in O(log N) — about 23 comparisons. That's fast enough for a single-node deployment.
But what if you need a leaderboard across shards? ZREVRANK only works within a single sorted set. If your players are distributed across 3 Redis Cluster nodes, you can't do a global rank query.
Options:
- Store the entire leaderboard on one node. If it fits in memory (and 10M entries at ~100 MB usually does), keep it simple.
- Periodic aggregation. Each shard maintains a local leaderboard. A background job merges them into a global leaderboard every few seconds. Ranks are slightly stale but close enough for most games.
- Hierarchy. Top 1,000 players live in a "global" sorted set. The remaining millions are in sharded sets. Rank within the top 1,000 is exact; rank below that is approximate.
Riot Games (League of Legends) uses a Redis-based leaderboard system for ranked matchmaking. They keep the active ranked player set in Redis and archive historical data to a database.
Pattern 7: Distributed Counter
INCR is atomic. A single Redis node handles hundreds of thousands of INCR operations per second. For a single counter, it's the fastest option available.
# Simple atomic counter
INCR page:views:/pricing
INCR page:views:/pricing
INCR page:views:/pricing
GET page:views:/pricing
# => "3"
# Counter with expiration (resets daily)
INCR daily:signups:2024-04-18
EXPIRE daily:signups:2024-04-18 172800 # keep for 2 days
The Sharding Problem
What happens when you need to count across a Redis Cluster? INCR only works on a single key, which lives on a single node. Two options:
Option 1: Single key, route all increments there. Works until the single node can't keep up with the write rate. At ~300K INCR/sec per node, this is enough for most applications.
Option 2: Sharded counters. Split the counter across N sub-keys, read by summing them.
# Write: pick a random shard (0-7)
INCR {counter:views}:shard:3 # hash tag {counter:views} ensures all shards on same node
# Read: sum all shards
MGET {counter:views}:shard:0 {counter:views}:shard:1 ... {counter:views}:shard:7
# Sum the results in application code
This trades read complexity for write throughput. Instagram uses a variant of this for like counts — they batch increments and flush to the database periodically rather than incrementing Redis on every single like.
When NOT to Use Redis
Knowing when to reach for Redis is important. Knowing when not to is what separates senior engineers from mid-levels.
| Situation | Why Not Redis | Use Instead |
|---|---|---|
| Data larger than RAM | Redis stores everything in memory. 100 GB dataset = 100 GB RAM. | PostgreSQL, MongoDB, DynamoDB |
| Strong consistency required | Async replication means followers can serve stale reads. | PostgreSQL, CockroachDB, Spanner |
| Complex queries | No JOINs, no WHERE clauses, no aggregation pipeline. | PostgreSQL, Elasticsearch |
| Multi-key transactions across slots | Cluster mode only supports transactions within a single hash slot. | PostgreSQL (full ACID) |
| Long-term storage | Redis is meant for hot data. Storing 5 years of logs in RAM is wasteful. | S3, BigQuery, ClickHouse |
| Data you can't afford to lose | Even with AOF, there's a data loss window. | PostgreSQL with synchronous replication |
Eviction Policies — What Happens When Memory Is Full
When Redis hits maxmemory, it needs to decide which keys to evict. The policy you choose defines the behavior.
| Policy | Evicts From | Strategy | Best For |
|---|---|---|---|
noeviction |
Nothing — returns errors on writes | Refuse new data | When you'd rather fail loudly than lose data |
allkeys-lru |
All keys | Least Recently Used | General-purpose caching (default choice) |
allkeys-lfu |
All keys | Least Frequently Used | Cache where popularity matters (hot content) |
volatile-lru |
Keys with TTL set | LRU among expiring keys | Mix of cache (TTL) and persistent data (no TTL) |
volatile-lfu |
Keys with TTL set | LFU among expiring keys | Same, but frequency-based |
allkeys-random |
All keys | Random eviction | When access patterns are truly uniform |
volatile-random |
Keys with TTL set | Random among expiring keys | Rarely useful |
volatile-ttl |
Keys with TTL set | Shortest TTL first | When you want near-expiry keys evicted first |
I'd pick allkeys-lru for 90% of caching use cases. It handles the common pattern well: frequently accessed keys stay, cold keys get evicted. Switch to allkeys-lfu if you have a workload where a key is accessed in a burst (appearing "recent") but isn't actually popular — LFU handles that better.
The Twitter caching team switched from LRU to LFU for their timeline cache because "scan pollution" (background jobs touching every key) was evicting hot user timeline data. LFU's frequency tracking prevented rarely-accessed-but-recently-scanned keys from displacing popular ones.
Gotcha
volatile-* policies only evict keys that have a TTL set. If all your keys are persistent (no TTL), volatile policies behave like noeviction — Redis will return errors when memory is full. This catches people who set volatile-lru but forget to set TTLs on their cache keys.
Memory Estimation — Back-of-Envelope Math
Interviewers love asking "how much Redis memory do you need for this?" Here's how to estimate.
Per-Key Overhead
Every key in Redis has overhead beyond the value itself:
| Component | Size |
|---|---|
| Key pointer (dictEntry) | ~64 bytes |
| Key string (SDS header + content) | ~50 bytes + key length |
| Value object (redisObject) | ~16 bytes |
| Expiry (if TTL set) | ~16 bytes |
| Total overhead per key | ~130-150 bytes |
Estimation Examples
1. Session store: 1 million active sessions
Key: "session:abc123xyz" ~30 bytes
Value: JSON blob ~200 bytes
Overhead per key: ~150 bytes
Total per entry: ~380 bytes
1,000,000 × 380 bytes = ~380 MB
2. Leaderboard: 10 million players in a sorted set
One sorted set key (overhead): ~150 bytes
Each member (skip list entry):
- Member string: ~30 bytes
- Score (float): 8 bytes
- Skip list node pointers: ~40 bytes
Total per member: ~78 bytes
10,000,000 × 78 bytes = ~780 MB
Plus overhead: ~150 bytes (negligible)
Total: ~780 MB
3. Rate limiter: 500K users, 100 entries per sorted set window
500,000 sorted set keys × (150 bytes overhead + 100 × 78 bytes)
= 500,000 × (150 + 7,800)
= 500,000 × 7,950 bytes
= ~3.98 GB
Interview Tip
Always round up by 30-50% for fragmentation and internal allocator overhead. If your estimate says 4 GB, tell the interviewer "I'd provision a 6 GB Redis instance to account for memory fragmentation and peak spikes."
The Complete Interview Playbook
When Redis comes up in a system design interview, walk through this mental checklist:
1. Is this data hot (frequently accessed)?
└── No → Use a database directly
└── Yes ↓
2. Can I tolerate losing this data?
└── No → Redis as cache (cache-aside), DB is source of truth
└── Yes → Redis as primary store (sessions, counters, queues)
3. Which data structure fits?
└── Lookup by key → String or Hash
└── Ranking/scoring → Sorted Set
└── Uniqueness → Set or HyperLogLog
└── Queue/stream → List, Stream, or Pub/Sub
└── Boolean matrix → Bitmap
4. How much memory?
└── Back-of-envelope calculation
└── Add 30-50% for overhead
5. Durability needs?
└── Cache (ephemeral) → No persistence, or RDB for warm restart
└── Sessions → AOF everysec
└── Primary store → Hybrid (RDB + AOF), consider alternatives
6. Availability needs?
└── Can't afford downtime → Sentinel (3+ nodes)
└── Data doesn't fit one node → Cluster (6+ nodes)
7. Eviction policy?
└── General cache → allkeys-lru
└── Popularity matters → allkeys-lfu
└── Mix of cache + persistent → volatile-lru with TTLs
Key Takeaways
| Pattern | Redis Command | Watch Out For |
|---|---|---|
| Distributed lock | SET key val NX EX 30 | Release with Lua (check ownership). Single-node lock is fine for most cases. |
| Rate limiter (sliding) | ZADD + ZREMRANGEBYSCORE + ZCARD | Memory grows with window size. Use fixed window (INCR) for simpler cases. |
| Session store | SET EX + GET | Set TTLs. Don't store secrets in the value. |
| Cache-aside | GET → miss → DB → SET EX | Invalidate on write, don't update. Watch for stampedes. |
| Pub/Sub | PUBLISH + SUBSCRIBE | No persistence. Messages lost if nobody listens. Use Streams instead for durability. |
| Leaderboard | ZADD + ZREVRANK | Single sorted set can't span cluster nodes. Keep it on one node if possible. |
| Distributed counter | INCR | Shard the counter if write throughput exceeds single-node limits. |
| Eviction | maxmemory-policy | allkeys-lru for most caches. volatile-* does nothing without TTLs. |
| Memory sizing | Back-of-envelope | ~150 bytes overhead per key. Add 30-50% for fragmentation. |
Interview Tip
Don't just name the pattern — explain the failure mode. "I'd use a distributed lock with SET NX EX, and to prevent stale locks I'd include a UUID in the value and release with a Lua script that checks ownership." That one sentence shows you understand atomic operations, lock leases, and the release race condition. Three concepts in one breath.