Design a Distributed Rate Limiter
TL;DR
A distributed rate limiter counts requests per client across multiple server instances and rejects traffic that exceeds a configured threshold. The hard parts are not the counting algorithms (those are well understood) -- it is making the counting consistent when requests scatter across 50 servers, doing it in under 1ms of added latency, and handling the edge cases where Redis goes down but you still need to protect your backend. Stripe, Cloudflare, and every major API provider have battle-tested implementations, and their public engineering blogs tell you exactly where the naive version falls apart.
The System
A rate limiter sits in the request path (or sidecar) and decides, for each incoming request, whether to allow it or reject it with HTTP 429 Too Many Requests. It enforces rules like "user X can make 100 requests per minute to the /api/search endpoint" or "IP address Y can make 10 login attempts per hour."
Rate limiting is not optional for public-facing APIs. Stripe's API enforces per-key limits (100 reads/sec for live mode, 25 for test mode). Cloudflare processes over 50 million HTTP requests per second globally and uses rate limiting at multiple layers -- per-IP, per-path, per-API-key -- to protect origin servers from both abuse and accidental traffic spikes. GitHub's API returns X-RateLimit-Remaining headers so clients can self-throttle. Without rate limiting, a single misbehaving client can take down a service that serves millions. The 2016 Dyn DNS attack proved that even infrastructure providers need rate limiting at the network edge.
Requirements
Functional
- Rate check: For each incoming request, determine whether it should be allowed or rejected based on the client's recent request count
- Configurable rules: Support rules like "100 requests per minute per user for endpoint /api/search" with the ability to change thresholds without redeployment
- Multi-dimensional limiting: Rate limit by user ID, API key, IP address, or combinations (e.g., per-user AND per-IP)
- Response headers: Return
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Reseton every response so clients can self-throttle - Graceful rejection: Return HTTP 429 with a
Retry-Afterheader indicating when the client can try again
Non-Functional
- Latency overhead: Adding rate limiting must cost less than 1ms p99 per request. Anything higher and product teams will refuse to adopt it
- Throughput: Handle 1M requests/sec across 100M distinct users (global API gateway scale)
- Accuracy: Slight over-counting (rejecting a few extra requests) is acceptable. Under-counting (letting 120 requests through when the limit is 100) is not -- it defeats the purpose
- Availability: If the rate limiter is down, default to allowing traffic (fail-open). A rate limiter that takes down the entire API is worse than no rate limiter at all
- Consistency: Across N servers, a client's count must be accurate within 5% of the true value. Perfect accuracy across distributed nodes is not worth the latency cost
Back-of-Envelope Math
Request volume:
1M requests/sec = 86.4B requests/day
Distributed across 50 API gateway instances: 20K req/sec per instance
Rate limit state per user:
Key: user_id + endpoint + window (e.g., "user:123:/api/search:1714000000")
Value: counter (8 bytes)
Overhead per key in Redis: ~80 bytes (key + value + hash table entry)
Active users in a window:
100M total users, ~5% active in any given minute = 5M active users
5M * 80 bytes = 400 MB
Fits in a single Redis instance with room to spare
Redis throughput needed:
1M rate checks/sec, each requiring 1 Redis roundtrip
Single Redis instance: ~200K ops/sec
Need 5-6 Redis instances (or Redis Cluster with 6 shards)
Latency budget:
Redis roundtrip (same datacenter): 0.1-0.5ms
Network hop to Redis: 0.05ms
Total rate limiting overhead: 0.2-0.6ms (well under 1ms target)
The Naive Design
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Client │────>│ API Server │────>│ Redis │
│ │<────│ │<────│ (single) │
└──────────┘ └──────────────┘ └──────────────┘
On each request:
key = "rate:{user_id}:{endpoint}:{current_minute}"
count = redis.INCR(key)
if count == 1:
redis.EXPIRE(key, 60)
if count > limit:
return 429 Too Many Requests
This is a fixed-window counter with Redis. For a startup with one API server and 100 req/sec, it works. Ship it.
Where Does This Break First?
The INCR and EXPIRE are two separate commands. If the server crashes between them, the key lives forever and the counter never resets. But that is a fixable bug. The real problem is the fixed-window boundary effect: a user can send 100 requests at 12:00:59 and 100 more at 12:01:00, hitting you with 200 requests in 2 seconds while technically respecting the "100 per minute" limit.
Where It Breaks
Three things kill the naive design at scale.
Problem 1: Fixed-window boundary burst. The window resets on a hard boundary (the start of each minute). A client can pile 100 requests at the end of minute N and 100 at the start of minute N+1, delivering a 2x burst in a 2-second window. Your backend, sized for 100 req/min, suddenly gets 200 in 2 seconds. This is not theoretical -- every API team that has used fixed windows has seen this pattern from bots and scrapers that deliberately time their bursts.
Problem 2: Race condition between INCR and EXPIRE. Two commands, not atomic. If Redis or the server fails between them, you leak a key that never expires. Over time, your Redis memory grows unboundedly. The fix is a Lua script (discussed below), but most candidates miss this.
Problem 3: Centralized Redis bottleneck. At 1M req/sec, a single Redis instance tops out around 200K ops/sec. You need sharding. But sharding by user ID means a user's requests always go to the same shard, which is correct for counting. The failure mode is hot keys: when one API key generates 50K req/sec (a big enterprise customer), that single shard gets hammered.
There is a fourth problem that matters for global deployments: if your API servers span multiple regions but your Redis is in one region, every rate check adds cross-region latency (50-150ms). That blows through your 1ms budget entirely.
The Real Design
┌────────────────────────────────────┐
│ Rules Configuration │
│ (etcd / config service / DB) │
└────────────────┬───────────────────┘
│ push on change
v
┌──────────┐ ┌──────────────────────────────────────┐
│ │ │ API Gateway │
│ Client │────>│ ┌──────────────────────────────┐ │
│ │<────│ │ Rate Limiter Middleware │ │
│ │ │ │ ┌─────────┐ ┌────────────┐ │ │
└──────────┘ │ │ │ Local │ │ Sync to │ │ │
│ │ │ Counter │──│ Redis │ │ │
│ │ └─────────┘ └────────────┘ │ │
│ └──────────────────────────────┘ │
└───────────────┬──────────────────────┘
│
v
┌────────────────────────────────────┐
│ Redis Cluster (6 shards) │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Shard1│ │Shard2│ │Shard3│ ... │
│ └──────┘ └──────┘ └──────┘ │
└────────────────────────────────────┘
The Five Rate Limiting Algorithms
You need to know all five. The interviewer will ask you to pick one and justify it.
1. Fixed Window Counter
Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject if over limit.
Window: [12:00, 12:01) -> counter = 78
Window: [12:01, 12:02) -> counter = 12
Pros: Simple, low memory (one counter per user per window)
Cons: Boundary burst problem (2x traffic at window edges)
Memory: 8 bytes per user
2. Sliding Window Log
Store the timestamp of every request. On each new request, remove timestamps older than the window, count remaining.
User 123 log: [11:59:32, 11:59:45, 12:00:01, 12:00:12, 12:00:33]
New request at 12:00:48 -> remove entries before 11:59:48 -> count = 4
Pros: Perfectly accurate, no boundary burst
Cons: O(N) memory per user where N is the rate limit.
For 1000 req/min limit, storing 1000 timestamps = 8KB per user.
5M active users * 8KB = 40 GB. Does not fit in one Redis instance.
3. Sliding Window Counter (the sweet spot)
Combine two adjacent fixed windows with a weighted average based on where you are in the current window.
Previous window (12:00-12:01): 84 requests
Current window (12:01-12:02): 36 requests
Current time: 12:01:15 (25% into current window)
Weighted count = 84 * (1 - 0.25) + 36 = 84 * 0.75 + 36 = 63 + 36 = 99
Pros: Near-perfect accuracy (within 0.003% error per Cloudflare's measurements)
Cons: Slight approximation (not exact like sliding log)
Memory: 16 bytes per user (two counters)
This is what Cloudflare uses in production. The accuracy is statistically excellent and the memory is tiny.
4. Token Bucket
A bucket holds tokens up to a maximum (burst size). Tokens refill at a constant rate. Each request consumes a token. If the bucket is empty, reject.
Bucket for user 123:
max_tokens = 100
refill_rate = 100 tokens/minute
current_tokens = 47
last_refill = 12:00:32
Request arrives at 12:00:45:
elapsed = 13 seconds
tokens_to_add = 100 * (13/60) = 21.67 -> 21
current_tokens = min(100, 47 + 21) = 68
Consume 1 token -> 67 remaining. Allow.
Pros: Natural burst handling (full bucket allows a burst up to max_tokens)
Cons: Two parameters to tune (bucket size, refill rate)
Memory: 16 bytes per user (current_tokens + last_refill_timestamp)
This is what Stripe uses. The token bucket lets Stripe allow short bursts (a batch API call) while enforcing a sustained rate. Amazon API Gateway also uses token bucket.
5. Leaky Bucket
Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, reject the request.
Queue for user 123:
max_size = 100
drain_rate = 100 requests/minute (1 every 0.6 seconds)
current_size = 42
Pros: Smooths traffic perfectly (output is always at drain_rate)
Cons: Recent requests wait behind old ones. Old requests may become stale
while waiting. Bad for latency-sensitive APIs.
Memory: Proportional to queue size (actually stores requests)
Leaky bucket is better for traffic shaping (network routers) than API rate limiting. I would not recommend it for this use case.
My recommendation: Token bucket for most API rate limiting. It handles bursts gracefully, is easy to reason about, and maps directly to how you describe limits to users: "100 requests per minute, burst up to 20." Use sliding window counter if you want simpler implementation with no burst tolerance.
Stripe's Rate Limiting Hierarchy
Stripe does not apply a single rate limit. They layer multiple limits:
Layer 1: Global per-API-key → 100 req/sec (live mode), 25 (test)
Layer 2: Per-endpoint → /v1/charges: 50 req/sec
Layer 3: Per-resource → Creating charges on same customer: 5 req/sec
Layer 4: Concurrent requests → Max 25 in-flight requests per key
Each layer uses a separate token bucket. A request must pass ALL layers to be allowed. This prevents a single customer from monopolizing one endpoint while staying under the global limit.
The concurrent request limit (Layer 4) is particularly clever -- it catches clients that fire 100 slow requests in parallel, which would not trip a rate-per-second limit but would consume 100 backend threads.
Redis Lua Script for Atomicity
The INCR + EXPIRE race condition is solved with a Lua script that executes atomically:
-- Token bucket implementation in Redis Lua
-- KEYS[1] = rate limit key
-- ARGV[1] = max_tokens, ARGV[2] = refill_rate (per sec)
-- ARGV[3] = now (unix timestamp), ARGV[4] = tokens_requested
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill)
tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return { allowed, tokens }
This entire block executes atomically on Redis. No race conditions. No partial state. The EXPIRE at the end is a safety net -- if a user goes inactive, their key auto-deletes after 2x the refill period.
Distributed: Local + Sync vs. Centralized
At 1M req/sec, you have two architectural choices for the counting layer.
Option A: Centralized (every check hits Redis)
Every request makes a Redis call. Simple and accurate. At 1M req/sec, you need 5-6 Redis shards. Cross-datacenter latency is the killer -- if your API servers are in us-east but Redis is in us-west, you add 60-80ms per request.
Option B: Local counters + periodic sync
Each API server maintains a local counter in memory. Every N seconds (or every N requests), it syncs with Redis. Between syncs, the local counter is used for decisions.
API Server A: local_count = 23, last_sync = 12:00:42
API Server B: local_count = 31, last_sync = 12:00:43
Redis: synced_total = 89
Reality: user has made 23 + 31 + 89 = 143 requests
But each server only sees its own 23 or 31, so neither rejects.
The trade-off is clear. Local counters are fast (no network hop) but inaccurate. If you have 50 API servers and sync every 5 seconds, a user could get 50 * limit / sync_frequency extra requests through before the system catches up. For a 100 req/min limit, that could mean 150+ requests get through.
The hybrid approach (what I would build):
- For high-volume endpoints (> 1000 req/sec per user possible): centralized Redis check. The latency cost is worth the accuracy.
- For normal endpoints: local counter with sync every 1 second. Accuracy within 5% is acceptable.
- If Redis is unreachable: fall back to local-only counters with a conservative limit (50% of the configured limit per server). This prevents the fail-open path from being exploited.
Deep Dives

Deep Dive 1: Race Conditions in Distributed Rate Limiting
The "check-then-increment" pattern has a classic TOCTOU (time-of-check, time-of-use) race:
Thread A: GET counter -> 99 (under limit of 100)
Thread B: GET counter -> 99 (under limit of 100)
Thread A: SET counter = 100 -> allowed
Thread B: SET counter = 100 -> allowed (should have been 101, rejected)
Both threads see 99 and both allow, but 101 requests have now passed. At 1M req/sec with 50 servers, this race is not occasional -- it is constant.
Fix 1: Redis INCR (atomic increment)
INCR is atomic in Redis. Two concurrent calls will return 100 and 101, never both returning 100. This eliminates the race for counter-based algorithms.
Fix 2: Lua scripts for multi-step algorithms
Token bucket requires reading current tokens, computing refill, and writing back -- multiple operations. The Lua script shown above runs atomically. Redis guarantees single-threaded execution of Lua scripts, so no races.
Fix 3: Redis MULTI/EXEC (transactions)
This groups INCR and EXPIRE atomically. But MULTI/EXEC does not support conditional logic (you cannot branch on the INCR result inside the transaction). Lua scripts are strictly more powerful.
The subtle race nobody talks about: Even with atomic Redis operations, there is a race between the rate check and the actual request processing. You decrement a token, send 200 OK, and then the request fails downstream. You have consumed a rate limit token for a request that did not actually succeed. Should you refund the token? Most systems do not bother. The complexity is not worth it, and the error is in the conservative direction (you rate limit slightly more aggressively than necessary).
Deep Dive 2: Handling 100M Users Across Multiple Regions
At global scale, the single-region Redis design breaks. Here is what changes.
Sharding strategy: Hash the rate limit key (user_id:endpoint:window) and route to one of N Redis shards. Users are sticky to a shard. With consistent hashing (or Redis Cluster's hash slots), adding a shard does not require re-distributing all keys.
Multi-region deployment:
US-East: API servers → Redis Cluster (us-east)
EU-West: API servers → Redis Cluster (eu-west)
AP-South: API servers → Redis Cluster (ap-south)
If user requests only come from one region, this is fine -- each region has its own Redis. But if a user sends requests from both US and EU simultaneously (CDN, mobile roaming, VPN), their counts are split across two Redis clusters.
Option A: Region-local limits (divide the limit by region count)
Give each region 1/3 of the total limit. User gets 33 req/min in US, 33 in EU, 33 in AP. Simple, but wastes capacity -- a user in only one region gets 1/3 of their limit.
Option B: Async cross-region sync
Each region publishes its local counts to a Kafka topic. Other regions consume and update a global view with a lag of 1-5 seconds. Accurate to within ~5 seconds of request data. This is what Cloudflare uses -- they call it "sliding window with cross-PoP synchronization."
Option C: Single-region source of truth with caching
Route all rate limit checks to one primary region (us-east). Cache the result for 1 second locally. For a 100 req/min limit, 1 second of caching means at most 1.67 extra requests. Acceptable for most use cases. The latency hit (50-80ms cross-region) is the downside.
I would use Option B for global APIs (Cloudflare's approach) and Option C for services where slight latency is acceptable (internal platform services).
Deep Dive 3: What Happens When Redis Goes Down
This is the question that separates mid-level from senior answers. Your rate limiter depends on Redis. Redis goes down. What now?
Fail-open (allow all traffic)
The safest default for user experience. If you cannot check the rate, let the request through. The downside: during a Redis outage, you have zero rate limiting. An attacker who notices the outage can flood your backend.
Fail-closed (reject all traffic)
The safest for backend protection. But if Redis goes down for 2 minutes, every API request returns 429 for 2 minutes. Your users think the API is down. This is almost always wrong.
The right answer: fail-open with local fallback
try:
allowed = check_redis_rate_limit(user_id, endpoint)
except RedisConnectionError:
allowed = check_local_rate_limit(user_id, endpoint,
limit=configured_limit / num_servers)
When Redis is unreachable, fall back to local in-memory counters. Set the local limit to global_limit / num_servers -- if you have 50 servers and the global limit is 100 req/min, each server allows 2 req/min locally. The total across all servers is approximately the global limit.
This is imprecise (a user hitting only one server gets 2 req/min, not 100), but it provides some protection during the outage and never causes a full API blackout.
Circuit breaker on Redis calls: If Redis fails 3 consecutive times within 10 seconds, stop trying for 30 seconds (circuit open). Retry one request every 30 seconds to check if Redis is back (half-open). This prevents thousands of failed Redis calls from adding latency.
Alternative Designs
Alternative 1: API Gateway-Native Rate Limiting
Use your cloud provider's built-in rate limiting (AWS API Gateway, Kong, Envoy). These use token bucket internally and store state in their own distributed backend.
Pros: zero custom code, managed scaling, works out of the box. Cons: limited algorithm choice, harder to customize per-endpoint logic, vendor lock-in on rate limiting configuration.
Alternative 2: Application-Level with Middleware
Rate limiting as a library/middleware in each service (like Guava's RateLimiter in Java). No external dependency.
Pros: no Redis dependency, zero network latency, per-process rate limiting. Cons: no coordination between instances (each server independently limits), so actual throughput is limit * num_servers.
Alternative 3: Sidecar Proxy (Envoy/Istio)
Rate limiting in a sidecar proxy (Envoy) with a centralized rate limit service (Lyft's ratelimit service). Each request goes through the local Envoy sidecar, which calls the rate limit service.
Pros: language-agnostic, consistent across all services, Envoy handles the complexity. Cons: extra hop (sidecar -> rate limit service -> Redis), operational complexity of running Envoy fleet.
| Aspect | Centralized Redis | API Gateway Native | App Middleware | Sidecar Proxy |
|---|---|---|---|---|
| Accuracy | High (single source) | High (managed) | Low (per-instance only) | High (centralized) |
| Latency overhead | 0.2-0.5ms | Built into gateway | ~0ms | 0.5-1ms |
| Custom algorithms | Full control | Limited | Full control | Limited to Envoy config |
| Multi-region support | Build it yourself | Provider handles it | None | Envoy mesh handles it |
| Ops complexity | Medium (Redis Cluster) | Low (managed) | Low | High (Envoy + service) |
| Redis failure mode | Need fallback logic | Managed HA | N/A | Need fallback logic |
Scaling Math Verification
1M requests/sec distributed across 50 API servers:
- Per server: 20K req/sec
- Redis calls: 1M/sec (one per request for centralized approach)
- Redis Cluster with 6 shards: ~167K calls/shard/sec. Each shard handles 200K+ ops/sec. Headroom: 20%.
- Network bandwidth: 1M * 200 bytes (request + response) = 200 MB/sec. Well within 10Gbps NIC.
Memory at peak:
- 5M active users in any given window * 80 bytes per key = 400 MB
- With token bucket (2 fields per key): 5M * 120 bytes = 600 MB
- Redis Cluster overhead: ~1.5x data size = 900 MB total
- A 2 GB Redis instance per shard handles this easily
Local counter sync overhead:
- 50 servers syncing every 1 second: 50 sync operations/sec to Redis
- Each sync: batch HMSET of ~1000 keys = 50K key updates/sec
- This is 0.025% of Redis capacity. Negligible.
Failure Analysis
| Component | Current capacity | At 10x (10M req/sec) | Breaks? | Fix |
|---|---|---|---|---|
| Redis Cluster (6) | 200K ops/shard/sec | 1.67M ops/shard/sec | Yes | Scale to 60 shards or use local counters |
| Memory per shard | 2 GB | 6 GB (50M active users) | No | Larger instances |
| API gateway latency | 0.3ms per rate check | Same | No | -- |
| Network bandwidth | 200 MB/sec | 2 GB/sec | Maybe | Multiple NICs, reduce payload size |
| Cross-region sync | 50 syncs/sec | 500 syncs/sec | No | -- |
| Rule configuration | etcd with 1000 rules | 10K rules | No | -- |
The first thing to break at 10x is Redis throughput per shard. The fix is straightforward: shift to the hybrid local+sync approach for most traffic, using centralized Redis only for critical endpoints (authentication, payment). At 10M req/sec, you cannot afford a Redis call for every request anyway -- the local counter approach with 1-second sync gives you 95% accuracy with zero Redis latency on the hot path.
At 100x (100M req/sec), you are Cloudflare. At that point, rate limiting happens at the network edge using eBPF programs in the kernel, not in application space. The counting moves to shared memory segments that the kernel networking stack can access without context switching to userspace. This is how Cloudflare's "magic firewall" works.
What's Expected at Each Level
| Aspect | Mid-Level | Senior | Staff+ |
|---|---|---|---|
| Algorithms | Knows token bucket or fixed window | Compares 3+ algorithms with trade-offs | All 5 algorithms, explains why sliding window counter wins |
| Atomicity | Uses INCR | Identifies INCR+EXPIRE race, mentions Lua | Writes the Lua script, explains Redis single-threaded execution |
| Distribution | "Use Redis" | Sharding by user, discusses hot keys | Local+sync hybrid, cross-region sync strategies |
| Failure handling | Not mentioned | Fail-open vs fail-closed | Local fallback with divided limits, circuit breaker on Redis |
| Multi-dimensional limiting | Single dimension (user ID) | Mentions per-endpoint limits | Stripe's layered hierarchy, concurrent request limiting |
| Configuration | Hardcoded limits | Config file or database | Dynamic rule engine with etcd push, per-tenant overrides |
| Headers | Returns 429 | Mentions X-RateLimit headers | Retry-After with jitter to prevent thundering herd on retry |
| Real-world reference | None | "Stripe does X" | Cloudflare's sliding window, Envoy's rate limit service, why leaky bucket is wrong for APIs |
The single most important signal at any level: do you understand that the hard part is not the algorithm -- it is making the algorithm work correctly when the same user's requests are hitting 50 different servers simultaneously? Mid-level candidates spend 30 minutes on token bucket. Senior candidates spend 5 minutes on the algorithm and 25 minutes on distribution, failure modes, and configuration management.
References from Our Courses
- Redis Interview Patterns — sliding window and token bucket implementations in Redis
- Traffic Control — backpressure and load shedding strategies
- Kafka Partitions and Ordering — async counter synchronization across rate limiter nodes
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.