Skip to content

Redis Patterns for System Design Interviews

TL;DR

Knowing Redis data structures is step one. Knowing the seven patterns that show up repeatedly in interviews — distributed locks, rate limiters, session stores, cache-aside, pub/sub, leaderboards, and distributed counters — is what actually gets you the offer. This lesson covers each pattern with the exact Redis commands, the trade-offs, and the "gotcha" the interviewer is hoping you'll mention.

Distributed lock acquisition and release flow

Pattern 1: Distributed Lock

You have multiple servers. Two of them try to process the same order at the same time. One of them needs to win. That's a distributed lock.

The Basic Pattern

# Acquire lock
SET lock:order:5543 "worker-7-uuid" NX EX 30
# NX  = only set if the key doesn't exist (atomic test-and-set)
# EX  = auto-expire after 30 seconds (lease timeout)
# Returns "OK" if acquired, nil if someone else holds it

# Do your work...

# Release lock (only if you still own it — use Lua for atomicity)
EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end" 1 lock:order:5543 "worker-7-uuid"

Why the Lua script for release? Because GET + DEL is two operations. Between them, your lock could expire and another worker could acquire it. Then your DEL deletes their lock. The Lua script runs atomically inside Redis — no interleaving possible.

Why include the worker UUID? So you only release your own lock. Without it, Worker A could release Worker B's lock.

The Redlock Controversy

What if your single Redis node crashes while holding the lock? The lock vanishes. Two workers now think they have it.

Antirez (Redis creator) proposed Redlock: acquire the lock on N/2+1 independent Redis nodes. If you get a majority, you hold the lock. If you don't, release everything and retry.

Redlock with 5 independent Redis nodes:

Worker tries to acquire lock on all 5 nodes:
  Node 1: ✅ acquired
  Node 2: ✅ acquired
  Node 3: ✅ acquired    ← 3/5 = majority, lock is held
  Node 4: ❌ timeout
  Node 5: ❌ failed

Then Martin Kleppmann published "How to do distributed locking" — one of the most cited distributed systems blog posts ever. His argument: Redlock doesn't actually work because of clock drift and process pauses.

Consider this scenario:

  1. Worker A acquires the Redlock.
  2. Worker A gets paused by a garbage collection event for 35 seconds.
  3. The lock expires (30-second TTL).
  4. Worker B acquires the lock legitimately.
  5. Worker A wakes up, doesn't realize the lock expired, and proceeds as if it still holds it.
  6. Both workers are now inside the critical section.

Kleppmann's fix: use fencing tokens. Every lock acquisition returns a monotonically increasing token. The storage system (database, queue) rejects operations with a token older than the one it last saw. This requires the downstream system to participate in the locking protocol — Redis alone isn't enough.

My take? For most applications, the single-node SET NX EX pattern is fine. You accept that a Redis crash means a brief window of no lock enforcement. If you need ironclad mutual exclusion, use a coordination service like ZooKeeper or etcd — they're designed for exactly this. Redlock is an uncomfortable middle ground: more complexity than a single node, less safety than a proper consensus system.

When to Use Distributed Locks

  • Order processing (prevent double-charging)
  • Cron job coordination (only one server runs the daily report)
  • Resource reservation (only one user can edit this document)
  • Cache stampede prevention (only one worker repopulates the cache)

Pattern 2: Rate Limiter

"Design a rate limiter" is one of the top 5 system design interview questions. Redis gives you three approaches, each with different precision-complexity trade-offs.

Approach 1: Fixed Window with INCR

# Allow 100 requests per minute per user
# Key format: rate:{user_id}:{minute_timestamp}

INCR rate:user:1001:1713400020
# Returns the current count

EXPIRE rate:user:1001:1713400020 60
# Auto-cleanup after the window

# In application code:
# if count > 100: reject request (HTTP 429)
# else: allow

Problem: The boundary burst. A user sends 100 requests at 0:59 and 100 more at 1:00. They've sent 200 requests in 2 seconds, but each window only sees 100. The fixed window says "allowed."

Approach 2: Sliding Window with Sorted Sets

This is the pattern interviewers want to see.

# For each request:
# 1. Add the request timestamp to a sorted set
ZADD rate:user:1001 1713400025.123 "req:uuid-abc"

# 2. Remove entries older than the window (60 seconds ago)
ZREMRANGEBYSCORE rate:user:1001 0 1713399965.123

# 3. Count remaining entries
ZCARD rate:user:1001
# If count > 100: reject (HTTP 429)

# 4. Set expiry on the key itself (cleanup if user goes idle)
EXPIRE rate:user:1001 60

No boundary burst problem. The window slides with each request. The cost is O(log N) per operation instead of O(1), and you store every request ID in the window — higher memory usage.

For a user limited to 100 requests/minute, you're storing at most 100 entries per user. That's fine. For a global rate limiter handling millions of entries per window, this gets expensive.

Approach 3: Token Bucket with Lua Script

The token bucket is the most flexible: it supports burst allowances and smooth refill rates. But it requires a Lua script to be atomic.

-- Lua script for token bucket
-- KEYS[1] = rate limit key
-- ARGV[1] = max tokens (bucket capacity)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = tokens to consume (usually 1)

local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local consume = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or max_tokens
local last_refill = tonumber(bucket[2]) or now

-- Calculate tokens to add since last refill
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

if new_tokens >= consume then
    new_tokens = new_tokens - consume
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
    return 0  -- rejected
end
# Usage: allow 10 requests/sec with burst up to 20
EVAL <script> 1 rate:user:1001 20 10 1713400025.5 1

Rate Limiter Comparison

Approach Precision Memory per User Complexity Boundary Burst?
Fixed window (INCR) Low 1 key Trivial Yes
Sliding window (sorted set) High N entries per window Medium No
Token bucket (Lua) High 1 hash (2 fields) High (Lua script) No (built-in burst control)

For interviews, I'd lead with the sliding window sorted set approach. It's the most intuitive to explain on a whiteboard, and interviewers can clearly see how the window slides. Mention the token bucket as a follow-up if they push on burst handling.

Pattern 3: Session Store

Why does every production web app use Redis for sessions? Because the alternative — database sessions — is slow and bloated.

# Store session on login
SET session:abc123xyz '{
  "user_id": 1001,
  "email": "alice@example.com",
  "role": "admin",
  "login_at": 1713400000
}' EX 86400
# Expires in 24 hours

# On every request: check session
GET session:abc123xyz
# Returns the JSON blob or nil (expired/invalid)

# On logout: destroy session
DEL session:abc123xyz

# Extend session on activity (sliding expiration)
EXPIRE session:abc123xyz 86400

Why Redis beats a database for sessions:

Dimension Database Sessions Redis Sessions
Latency 2-10 ms (disk I/O, query parsing) 0.1-0.5 ms (in-memory)
Cleanup Cron job to delete expired rows TTL handles it automatically
Scale Adds load to your primary DB Separate infrastructure, horizontally scalable
Schema Need a sessions table, migrations No schema, just SET/GET
Sharing across services Need shared DB access Any service can hit Redis

When Heroku migrated their dashboard sessions from PostgreSQL to Redis, they cut session lookup latency by 10x and removed the cron job that was sweeping expired sessions every 5 minutes.

Gotcha

Don't store sensitive data directly in the session value. If your Redis instance is compromised (no authentication, exposed to the internet — which happens more often than you'd think), every active session is readable. Store a session ID in Redis that maps to an opaque token, and keep the actual user data server-side.

Pattern 4: Cache-Aside (Lazy Loading)

Cache-aside pattern showing read path with cache hit and miss

This is the most common caching pattern and the one interviewers expect you to know cold.

Read Path

Client → App Server → Check Redis
                    ┌─────┴─────┐
                    │           │
                  Cache       Cache
                  HIT         MISS
                    │           │
                    │     Query Database
                    │           │
                    │     Write result to Redis
                    │     SET key value EX 300
                    │           │
                    ▼           ▼
              Return data to client
# Pseudocode for cache-aside read
GET user:1001
# => nil (cache miss)

# Query database
# SELECT * FROM users WHERE id = 1001
# => {name: "Alice", plan: "premium"}

# Populate cache with TTL
SET user:1001 '{"name":"Alice","plan":"premium"}' EX 300

# Next request:
GET user:1001
# => '{"name":"Alice","plan":"premium"}' (cache hit!)

Write Path — Invalidation, Not Update

# User updates their profile in the database
# UPDATE users SET plan = 'free' WHERE id = 1001

# Invalidate the cache entry
DEL user:1001
# Do NOT try to update the cache here

# Next read will cache-miss, query DB, and repopulate with fresh data

Why invalidate instead of update? Because update introduces race conditions.

Consider two concurrent writes:

Timeline:
  T1: Worker A updates DB → plan = "premium"
  T2: Worker B updates DB → plan = "free"
  T3: Worker B updates cache → plan = "free"
  T4: Worker A updates cache → plan = "premium"  ← STALE!

The database says "free" but the cache says "premium." With invalidation, the cache simply gets deleted, and the next read fetches the correct value from the database.

Cache Stampede (Thundering Herd)

A popular cache key expires. 1,000 concurrent requests all see a cache miss. All 1,000 hit the database simultaneously. The database buckles.

Solutions:

# 1. Lock-based repopulation
SET lock:cache:user:1001 "worker-3" NX EX 5
# If acquired: query DB, populate cache, release lock
# If not acquired: wait briefly, retry GET

# 2. Stale-while-revalidate (probabilistic early expiration)
# Set cache TTL to 350 seconds, but embed the "logical" expiry at 300 seconds in the value
SET user:1001 '{"data":{...},"expires_at":1713400300}' EX 350
# When a reader sees expires_at is in the past:
#   - With 10% probability: refresh the cache in the background
#   - With 90% probability: return stale data (still within the 350s hard TTL)

The Facebook TAO paper describes how they handle cache stampedes at scale: a lease mechanism where the first client to miss gets a "lease" (permission to populate the cache), and all subsequent clients wait for that lease to be fulfilled.

Pattern 5: Pub/Sub — Fire and Forget

Redis Pub/Sub lets you broadcast messages to all subscribers of a channel. No persistence, no acknowledgment, no replay.

# Subscriber (in one Redis connection)
SUBSCRIBE notifications:user:1001
# Blocks and waits for messages

# Publisher (in another connection)
PUBLISH notifications:user:1001 '{"type":"new_message","from":"Bob"}'
# Returns the number of subscribers that received it

# Pattern-based subscription
PSUBSCRIBE notifications:*
# Receives messages on ANY notifications:* channel

When Pub/Sub Works

  • Real-time notifications to connected clients (via WebSocket bridge)
  • Cache invalidation across app servers ("key X was updated, invalidate your local copy")
  • Chat messages in a small-scale system
  • Configuration updates ("feature flag changed, reload")

When Pub/Sub Breaks

Here's the thing people miss: if nobody is listening, the message is gone. Redis Pub/Sub has no queue, no buffer, no replay. If your subscriber disconnects for 5 seconds and 3 messages are published during that time, those messages are lost forever.

If the subscriber is slow and can't keep up with the publish rate, Redis buffers messages in memory. If the buffer exceeds client-output-buffer-limit, Redis kills the subscriber connection. This happened at scale to several companies running Pub/Sub for real-time chat before they switched to Kafka or Redis Streams.

Feature Redis Pub/Sub Redis Streams Kafka
Persistence No Yes Yes
Replay from history No Yes Yes
Consumer groups No Yes Yes
Delivery guarantee At-most-once At-least-once At-least-once / Exactly-once
Backpressure handling Kill slow subscriber Consumer controls pace Consumer controls pace
Throughput High High Very high

My recommendation: Use Pub/Sub only for ephemeral broadcasts where losing messages is acceptable. For anything that needs durability, use Redis Streams or Kafka.

Pattern 6: Leaderboard

Gaming leaderboards, sales rankings, top-N lists — sorted sets handle all of these.

# Add or update scores
ZADD game:leaderboard 1500 "player:alice"
ZADD game:leaderboard 2300 "player:bob"
ZADD game:leaderboard 1800 "player:charlie"
ZADD game:leaderboard 2100 "player:dave"
ZADD game:leaderboard 1950 "player:eve"

# Top 3 (highest scores)
ZREVRANGE game:leaderboard 0 2 WITHSCORES
# => [("player:bob", 2300), ("player:dave", 2100), ("player:eve", 1950)]

# "What's my rank?" (0-indexed)
ZREVRANK game:leaderboard "player:charlie"
# => 3  (4th place)

# Score update — player earns 200 more points
ZINCRBY game:leaderboard 200 "player:charlie"
# charlie is now at 2000, rank updates automatically

# Range query: all players with score between 1800 and 2200
ZRANGEBYSCORE game:leaderboard 1800 2200 WITHSCORES

# Total players on the leaderboard
ZCARD game:leaderboard
# => 5

Scaling Leaderboards

A sorted set with 10 million members uses ~80-120 MB and ZREVRANK runs in O(log N) — about 23 comparisons. That's fast enough for a single-node deployment.

But what if you need a leaderboard across shards? ZREVRANK only works within a single sorted set. If your players are distributed across 3 Redis Cluster nodes, you can't do a global rank query.

Options:

  1. Store the entire leaderboard on one node. If it fits in memory (and 10M entries at ~100 MB usually does), keep it simple.
  2. Periodic aggregation. Each shard maintains a local leaderboard. A background job merges them into a global leaderboard every few seconds. Ranks are slightly stale but close enough for most games.
  3. Hierarchy. Top 1,000 players live in a "global" sorted set. The remaining millions are in sharded sets. Rank within the top 1,000 is exact; rank below that is approximate.

Riot Games (League of Legends) uses a Redis-based leaderboard system for ranked matchmaking. They keep the active ranked player set in Redis and archive historical data to a database.

Pattern 7: Distributed Counter

INCR is atomic. A single Redis node handles hundreds of thousands of INCR operations per second. For a single counter, it's the fastest option available.

# Simple atomic counter
INCR page:views:/pricing
INCR page:views:/pricing
INCR page:views:/pricing
GET page:views:/pricing
# => "3"

# Counter with expiration (resets daily)
INCR daily:signups:2024-04-18
EXPIRE daily:signups:2024-04-18 172800   # keep for 2 days

The Sharding Problem

What happens when you need to count across a Redis Cluster? INCR only works on a single key, which lives on a single node. Two options:

Option 1: Single key, route all increments there. Works until the single node can't keep up with the write rate. At ~300K INCR/sec per node, this is enough for most applications.

Option 2: Sharded counters. Split the counter across N sub-keys, read by summing them.

# Write: pick a random shard (0-7)
INCR {counter:views}:shard:3     # hash tag {counter:views} ensures all shards on same node

# Read: sum all shards
MGET {counter:views}:shard:0 {counter:views}:shard:1 ... {counter:views}:shard:7
# Sum the results in application code

This trades read complexity for write throughput. Instagram uses a variant of this for like counts — they batch increments and flush to the database periodically rather than incrementing Redis on every single like.

When NOT to Use Redis

Knowing when to reach for Redis is important. Knowing when not to is what separates senior engineers from mid-levels.

Situation Why Not Redis Use Instead
Data larger than RAM Redis stores everything in memory. 100 GB dataset = 100 GB RAM. PostgreSQL, MongoDB, DynamoDB
Strong consistency required Async replication means followers can serve stale reads. PostgreSQL, CockroachDB, Spanner
Complex queries No JOINs, no WHERE clauses, no aggregation pipeline. PostgreSQL, Elasticsearch
Multi-key transactions across slots Cluster mode only supports transactions within a single hash slot. PostgreSQL (full ACID)
Long-term storage Redis is meant for hot data. Storing 5 years of logs in RAM is wasteful. S3, BigQuery, ClickHouse
Data you can't afford to lose Even with AOF, there's a data loss window. PostgreSQL with synchronous replication

Eviction Policies — What Happens When Memory Is Full

When Redis hits maxmemory, it needs to decide which keys to evict. The policy you choose defines the behavior.

# Set max memory
maxmemory 4gb

# Set eviction policy
maxmemory-policy allkeys-lru
Policy Evicts From Strategy Best For
noeviction Nothing — returns errors on writes Refuse new data When you'd rather fail loudly than lose data
allkeys-lru All keys Least Recently Used General-purpose caching (default choice)
allkeys-lfu All keys Least Frequently Used Cache where popularity matters (hot content)
volatile-lru Keys with TTL set LRU among expiring keys Mix of cache (TTL) and persistent data (no TTL)
volatile-lfu Keys with TTL set LFU among expiring keys Same, but frequency-based
allkeys-random All keys Random eviction When access patterns are truly uniform
volatile-random Keys with TTL set Random among expiring keys Rarely useful
volatile-ttl Keys with TTL set Shortest TTL first When you want near-expiry keys evicted first

I'd pick allkeys-lru for 90% of caching use cases. It handles the common pattern well: frequently accessed keys stay, cold keys get evicted. Switch to allkeys-lfu if you have a workload where a key is accessed in a burst (appearing "recent") but isn't actually popular — LFU handles that better.

The Twitter caching team switched from LRU to LFU for their timeline cache because "scan pollution" (background jobs touching every key) was evicting hot user timeline data. LFU's frequency tracking prevented rarely-accessed-but-recently-scanned keys from displacing popular ones.

Gotcha

volatile-* policies only evict keys that have a TTL set. If all your keys are persistent (no TTL), volatile policies behave like noeviction — Redis will return errors when memory is full. This catches people who set volatile-lru but forget to set TTLs on their cache keys.

Memory Estimation — Back-of-Envelope Math

Interviewers love asking "how much Redis memory do you need for this?" Here's how to estimate.

Per-Key Overhead

Every key in Redis has overhead beyond the value itself:

Component Size
Key pointer (dictEntry) ~64 bytes
Key string (SDS header + content) ~50 bytes + key length
Value object (redisObject) ~16 bytes
Expiry (if TTL set) ~16 bytes
Total overhead per key ~130-150 bytes

Estimation Examples

1. Session store: 1 million active sessions

Key:   "session:abc123xyz"          ~30 bytes
Value: JSON blob                    ~200 bytes
Overhead per key:                   ~150 bytes
Total per entry:                    ~380 bytes

1,000,000 × 380 bytes = ~380 MB

2. Leaderboard: 10 million players in a sorted set

One sorted set key (overhead):      ~150 bytes
Each member (skip list entry):
  - Member string:                  ~30 bytes
  - Score (float):                  8 bytes
  - Skip list node pointers:       ~40 bytes
  Total per member:                 ~78 bytes

10,000,000 × 78 bytes = ~780 MB
Plus overhead:                      ~150 bytes (negligible)
Total:                              ~780 MB

3. Rate limiter: 500K users, 100 entries per sorted set window

500,000 sorted set keys × (150 bytes overhead + 100 × 78 bytes)
= 500,000 × (150 + 7,800)
= 500,000 × 7,950 bytes
= ~3.98 GB

Interview Tip

Always round up by 30-50% for fragmentation and internal allocator overhead. If your estimate says 4 GB, tell the interviewer "I'd provision a 6 GB Redis instance to account for memory fragmentation and peak spikes."

The Complete Interview Playbook

When Redis comes up in a system design interview, walk through this mental checklist:

1. Is this data hot (frequently accessed)?
   └── No → Use a database directly
   └── Yes ↓

2. Can I tolerate losing this data?
   └── No → Redis as cache (cache-aside), DB is source of truth
   └── Yes → Redis as primary store (sessions, counters, queues)

3. Which data structure fits?
   └── Lookup by key → String or Hash
   └── Ranking/scoring → Sorted Set
   └── Uniqueness → Set or HyperLogLog
   └── Queue/stream → List, Stream, or Pub/Sub
   └── Boolean matrix → Bitmap

4. How much memory?
   └── Back-of-envelope calculation
   └── Add 30-50% for overhead

5. Durability needs?
   └── Cache (ephemeral) → No persistence, or RDB for warm restart
   └── Sessions → AOF everysec
   └── Primary store → Hybrid (RDB + AOF), consider alternatives

6. Availability needs?
   └── Can't afford downtime → Sentinel (3+ nodes)
   └── Data doesn't fit one node → Cluster (6+ nodes)

7. Eviction policy?
   └── General cache → allkeys-lru
   └── Popularity matters → allkeys-lfu
   └── Mix of cache + persistent → volatile-lru with TTLs

Key Takeaways

Pattern Redis Command Watch Out For
Distributed lock SET key val NX EX 30 Release with Lua (check ownership). Single-node lock is fine for most cases.
Rate limiter (sliding) ZADD + ZREMRANGEBYSCORE + ZCARD Memory grows with window size. Use fixed window (INCR) for simpler cases.
Session store SET EX + GET Set TTLs. Don't store secrets in the value.
Cache-aside GET → miss → DB → SET EX Invalidate on write, don't update. Watch for stampedes.
Pub/Sub PUBLISH + SUBSCRIBE No persistence. Messages lost if nobody listens. Use Streams instead for durability.
Leaderboard ZADD + ZREVRANK Single sorted set can't span cluster nodes. Keep it on one node if possible.
Distributed counter INCR Shard the counter if write throughput exceeds single-node limits.
Eviction maxmemory-policy allkeys-lru for most caches. volatile-* does nothing without TTLs.
Memory sizing Back-of-envelope ~150 bytes overhead per key. Add 30-50% for fragmentation.

Interview Tip

Don't just name the pattern — explain the failure mode. "I'd use a distributed lock with SET NX EX, and to prevent stale locks I'd include a UUID in the value and release with a Lua script that checks ownership." That one sentence shows you understand atomic operations, lock leases, and the release race condition. Three concepts in one breath.