Design a Distributed Rate Limiter

TL;DR

A distributed rate limiter counts requests per client across multiple server instances and rejects traffic that exceeds a configured threshold. The hard parts are not the counting algorithms (those are well understood) -- it is making the counting consistent when requests scatter across 50 servers, doing it in under 1ms of added latency, and handling the edge cases where Redis goes down but you still need to protect your backend. Stripe, Cloudflare, and every major API provider have battle-tested implementations, and their public engineering blogs tell you exactly where the naive version falls apart.

The System

A rate limiter sits in the request path (or sidecar) and decides, for each incoming request, whether to allow it or reject it with HTTP 429 Too Many Requests. It enforces rules like "user X can make 100 requests per minute to the /api/search endpoint" or "IP address Y can make 10 login attempts per hour."

Rate limiting is not optional for public-facing APIs. Stripe's API enforces per-key limits (100 reads/sec for live mode, 25 for test mode). Cloudflare processes over 50 million HTTP requests per second globally and uses rate limiting at multiple layers -- per-IP, per-path, per-API-key -- to protect origin servers from both abuse and accidental traffic spikes. GitHub's API returns X-RateLimit-Remaining headers so clients can self-throttle. Without rate limiting, a single misbehaving client can take down a service that serves millions. The 2016 Dyn DNS attack proved that even infrastructure providers need rate limiting at the network edge.

Requirements

Functional

Rate check: For each incoming request, determine whether it should be allowed or rejected based on the client's recent request count
Configurable rules: Support rules like "100 requests per minute per user for endpoint /api/search" with the ability to change thresholds without redeployment
Multi-dimensional limiting: Rate limit by user ID, API key, IP address, or combinations (e.g., per-user AND per-IP)
Response headers: Return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every response so clients can self-throttle
Graceful rejection: Return HTTP 429 with a Retry-After header indicating when the client can try again

Non-Functional

Latency overhead: Adding rate limiting must cost less than 1ms p99 per request. Anything higher and product teams will refuse to adopt it
Throughput: Handle 1M requests/sec across 100M distinct users (global API gateway scale)
Accuracy: Slight over-counting (rejecting a few extra requests) is acceptable. Under-counting (letting 120 requests through when the limit is 100) is not -- it defeats the purpose
Availability: If the rate limiter is down, default to allowing traffic (fail-open). A rate limiter that takes down the entire API is worse than no rate limiter at all
Consistency: Across N servers, a client's count must be accurate within 5% of the true value. Perfect accuracy across distributed nodes is not worth the latency cost

Back-of-Envelope Math

Request volume:
  1M requests/sec = 86.4B requests/day
  Distributed across 50 API gateway instances: 20K req/sec per instance

Rate limit state per user:
  Key: user_id + endpoint + window (e.g., "user:123:/api/search:1714000000")
  Value: counter (8 bytes)
  Overhead per key in Redis: ~80 bytes (key + value + hash table entry)

Active users in a window:
  100M total users, ~5% active in any given minute = 5M active users
  5M * 80 bytes = 400 MB
  Fits in a single Redis instance with room to spare

Redis throughput needed:
  1M rate checks/sec, each requiring 1 Redis roundtrip
  Single Redis instance: ~200K ops/sec
  Need 5-6 Redis instances (or Redis Cluster with 6 shards)

Latency budget:
  Redis roundtrip (same datacenter): 0.1-0.5ms
  Network hop to Redis: 0.05ms
  Total rate limiting overhead: 0.2-0.6ms (well under 1ms target)

The Naive Design

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│  Client   │────>│  API Server  │────>│    Redis     │
│           │<────│              │<────│  (single)    │
└──────────┘     └──────────────┘     └──────────────┘

On each request:
  key = "rate:{user_id}:{endpoint}:{current_minute}"
  count = redis.INCR(key)
  if count == 1:
      redis.EXPIRE(key, 60)
  if count > limit:
      return 429 Too Many Requests

This is a fixed-window counter with Redis. For a startup with one API server and 100 req/sec, it works. Ship it.

Where Does This Break First?

The INCR and EXPIRE are two separate commands. If the server crashes between them, the key lives forever and the counter never resets. But that is a fixable bug. The real problem is the fixed-window boundary effect: a user can send 100 requests at 12:00:59 and 100 more at 12:01:00, hitting you with 200 requests in 2 seconds while technically respecting the "100 per minute" limit.

Where It Breaks

Three things kill the naive design at scale.

Problem 1: Fixed-window boundary burst. The window resets on a hard boundary (the start of each minute). A client can pile 100 requests at the end of minute N and 100 at the start of minute N+1, delivering a 2x burst in a 2-second window. Your backend, sized for 100 req/min, suddenly gets 200 in 2 seconds. This is not theoretical -- every API team that has used fixed windows has seen this pattern from bots and scrapers that deliberately time their bursts.

Problem 2: Race condition between INCR and EXPIRE. Two commands, not atomic. If Redis or the server fails between them, you leak a key that never expires. Over time, your Redis memory grows unboundedly. The fix is a Lua script (discussed below), but most candidates miss this.

Problem 3: Centralized Redis bottleneck. At 1M req/sec, a single Redis instance tops out around 200K ops/sec. You need sharding. But sharding by user ID means a user's requests always go to the same shard, which is correct for counting. The failure mode is hot keys: when one API key generates 50K req/sec (a big enterprise customer), that single shard gets hammered.

There is a fourth problem that matters for global deployments: if your API servers span multiple regions but your Redis is in one region, every rate check adds cross-region latency (50-150ms). That blows through your 1ms budget entirely.

The Real Design

                        ┌────────────────────────────────────┐
                        │         Rules Configuration        │
                        │  (etcd / config service / DB)      │
                        └────────────────┬───────────────────┘
                                         │ push on change
                                         v
┌──────────┐     ┌──────────────────────────────────────┐
│           │     │            API Gateway               │
│  Client   │────>│  ┌──────────────────────────────┐   │
│           │<────│  │  Rate Limiter Middleware       │   │
│           │     │  │  ┌─────────┐  ┌────────────┐  │   │
└──────────┘     │  │  │ Local   │  │ Sync to    │  │   │
                  │  │  │ Counter │──│ Redis      │  │   │
                  │  │  └─────────┘  └────────────┘  │   │
                  │  └──────────────────────────────┘   │
                  └───────────────┬──────────────────────┘
                                  │
                                  v
                  ┌────────────────────────────────────┐
                  │         Redis Cluster (6 shards)    │
                  │  ┌──────┐ ┌──────┐ ┌──────┐       │
                  │  │Shard1│ │Shard2│ │Shard3│ ...   │
                  │  └──────┘ └──────┘ └──────┘       │
                  └────────────────────────────────────┘

The Five Rate Limiting Algorithms

You need to know all five. The interviewer will ask you to pick one and justify it.

1. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject if over limit.

Window: [12:00, 12:01) -> counter = 78
Window: [12:01, 12:02) -> counter = 12

Pros: Simple, low memory (one counter per user per window)
Cons: Boundary burst problem (2x traffic at window edges)
Memory: 8 bytes per user

2. Sliding Window Log

Store the timestamp of every request. On each new request, remove timestamps older than the window, count remaining.

User 123 log: [11:59:32, 11:59:45, 12:00:01, 12:00:12, 12:00:33]
New request at 12:00:48 -> remove entries before 11:59:48 -> count = 4

Pros: Perfectly accurate, no boundary burst
Cons: O(N) memory per user where N is the rate limit. 
      For 1000 req/min limit, storing 1000 timestamps = 8KB per user.
      5M active users * 8KB = 40 GB. Does not fit in one Redis instance.

3. Sliding Window Counter (the sweet spot)

Combine two adjacent fixed windows with a weighted average based on where you are in the current window.

Previous window (12:00-12:01): 84 requests
Current window  (12:01-12:02): 36 requests
Current time: 12:01:15 (25% into current window)

Weighted count = 84 * (1 - 0.25) + 36 = 84 * 0.75 + 36 = 63 + 36 = 99

Pros: Near-perfect accuracy (within 0.003% error per Cloudflare's measurements)
Cons: Slight approximation (not exact like sliding log)
Memory: 16 bytes per user (two counters)

This is what Cloudflare uses in production. The accuracy is statistically excellent and the memory is tiny.

4. Token Bucket

A bucket holds tokens up to a maximum (burst size). Tokens refill at a constant rate. Each request consumes a token. If the bucket is empty, reject.

Bucket for user 123:
  max_tokens = 100
  refill_rate = 100 tokens/minute
  current_tokens = 47
  last_refill = 12:00:32

Request arrives at 12:00:45:
  elapsed = 13 seconds
  tokens_to_add = 100 * (13/60) = 21.67 -> 21
  current_tokens = min(100, 47 + 21) = 68
  Consume 1 token -> 67 remaining. Allow.

Pros: Natural burst handling (full bucket allows a burst up to max_tokens)
Cons: Two parameters to tune (bucket size, refill rate)
Memory: 16 bytes per user (current_tokens + last_refill_timestamp)

This is what Stripe uses. The token bucket lets Stripe allow short bursts (a batch API call) while enforcing a sustained rate. Amazon API Gateway also uses token bucket.

5. Leaky Bucket

Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, reject the request.

Queue for user 123:
  max_size = 100
  drain_rate = 100 requests/minute (1 every 0.6 seconds)
  current_size = 42

Pros: Smooths traffic perfectly (output is always at drain_rate)
Cons: Recent requests wait behind old ones. Old requests may become stale
      while waiting. Bad for latency-sensitive APIs.
Memory: Proportional to queue size (actually stores requests)

Leaky bucket is better for traffic shaping (network routers) than API rate limiting. I would not recommend it for this use case.

My recommendation: Token bucket for most API rate limiting. It handles bursts gracefully, is easy to reason about, and maps directly to how you describe limits to users: "100 requests per minute, burst up to 20." Use sliding window counter if you want simpler implementation with no burst tolerance.

Stripe's Rate Limiting Hierarchy

Stripe does not apply a single rate limit. They layer multiple limits:

Layer 1: Global per-API-key     → 100 req/sec (live mode), 25 (test)
Layer 2: Per-endpoint           → /v1/charges: 50 req/sec
Layer 3: Per-resource           → Creating charges on same customer: 5 req/sec
Layer 4: Concurrent requests    → Max 25 in-flight requests per key

Each layer uses a separate token bucket. A request must pass ALL layers to be allowed. This prevents a single customer from monopolizing one endpoint while staying under the global limit.

The concurrent request limit (Layer 4) is particularly clever -- it catches clients that fire 100 slow requests in parallel, which would not trip a rate-per-second limit but would consume 100 backend threads.

Redis Lua Script for Atomicity

The INCR + EXPIRE race condition is solved with a Lua script that executes atomically:

-- Token bucket implementation in Redis Lua
-- KEYS[1] = rate limit key
-- ARGV[1] = max_tokens, ARGV[2] = refill_rate (per sec)
-- ARGV[3] = now (unix timestamp), ARGV[4] = tokens_requested

local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now

-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill)
tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

return { allowed, tokens }

This entire block executes atomically on Redis. No race conditions. No partial state. The EXPIRE at the end is a safety net -- if a user goes inactive, their key auto-deletes after 2x the refill period.

Distributed: Local + Sync vs. Centralized

At 1M req/sec, you have two architectural choices for the counting layer.

Option A: Centralized (every check hits Redis)

Every request makes a Redis call. Simple and accurate. At 1M req/sec, you need 5-6 Redis shards. Cross-datacenter latency is the killer -- if your API servers are in us-east but Redis is in us-west, you add 60-80ms per request.

Option B: Local counters + periodic sync

Each API server maintains a local counter in memory. Every N seconds (or every N requests), it syncs with Redis. Between syncs, the local counter is used for decisions.

API Server A: local_count = 23, last_sync = 12:00:42
API Server B: local_count = 31, last_sync = 12:00:43
Redis:        synced_total = 89

Reality: user has made 23 + 31 + 89 = 143 requests
But each server only sees its own 23 or 31, so neither rejects.

The trade-off is clear. Local counters are fast (no network hop) but inaccurate. If you have 50 API servers and sync every 5 seconds, a user could get 50 * limit / sync_frequency extra requests through before the system catches up. For a 100 req/min limit, that could mean 150+ requests get through.

The hybrid approach (what I would build):

For high-volume endpoints (> 1000 req/sec per user possible): centralized Redis check. The latency cost is worth the accuracy.
For normal endpoints: local counter with sync every 1 second. Accuracy within 5% is acceptable.
If Redis is unreachable: fall back to local-only counters with a conservative limit (50% of the configured limit per server). This prevents the fail-open path from being exploited.

Deep Dives

Distributed Rate Limiter — Rate Limiter High-Level Design

Deep Dive 1: Race Conditions in Distributed Rate Limiting

The "check-then-increment" pattern has a classic TOCTOU (time-of-check, time-of-use) race:

Thread A: GET counter -> 99 (under limit of 100)
Thread B: GET counter -> 99 (under limit of 100)
Thread A: SET counter = 100 -> allowed
Thread B: SET counter = 100 -> allowed (should have been 101, rejected)

Both threads see 99 and both allow, but 101 requests have now passed. At 1M req/sec with 50 servers, this race is not occasional -- it is constant.

Fix 1: Redis INCR (atomic increment)

count = redis.INCR(key)
if count > limit:
    return 429

INCR is atomic in Redis. Two concurrent calls will return 100 and 101, never both returning 100. This eliminates the race for counter-based algorithms.

Fix 2: Lua scripts for multi-step algorithms

Token bucket requires reading current tokens, computing refill, and writing back -- multiple operations. The Lua script shown above runs atomically. Redis guarantees single-threaded execution of Lua scripts, so no races.

Fix 3: Redis MULTI/EXEC (transactions)

MULTI
INCR rate:user:123:12000
EXPIRE rate:user:123:12000 60
EXEC

This groups INCR and EXPIRE atomically. But MULTI/EXEC does not support conditional logic (you cannot branch on the INCR result inside the transaction). Lua scripts are strictly more powerful.

The subtle race nobody talks about: Even with atomic Redis operations, there is a race between the rate check and the actual request processing. You decrement a token, send 200 OK, and then the request fails downstream. You have consumed a rate limit token for a request that did not actually succeed. Should you refund the token? Most systems do not bother. The complexity is not worth it, and the error is in the conservative direction (you rate limit slightly more aggressively than necessary).

Deep Dive 2: Handling 100M Users Across Multiple Regions

At global scale, the single-region Redis design breaks. Here is what changes.

Sharding strategy: Hash the rate limit key (user_id:endpoint:window) and route to one of N Redis shards. Users are sticky to a shard. With consistent hashing (or Redis Cluster's hash slots), adding a shard does not require re-distributing all keys.

Multi-region deployment:

US-East:  API servers → Redis Cluster (us-east)
EU-West:  API servers → Redis Cluster (eu-west)  
AP-South: API servers → Redis Cluster (ap-south)

If user requests only come from one region, this is fine -- each region has its own Redis. But if a user sends requests from both US and EU simultaneously (CDN, mobile roaming, VPN), their counts are split across two Redis clusters.

Option A: Region-local limits (divide the limit by region count)

Give each region 1/3 of the total limit. User gets 33 req/min in US, 33 in EU, 33 in AP. Simple, but wastes capacity -- a user in only one region gets 1/3 of their limit.

Option B: Async cross-region sync

Each region publishes its local counts to a Kafka topic. Other regions consume and update a global view with a lag of 1-5 seconds. Accurate to within ~5 seconds of request data. This is what Cloudflare uses -- they call it "sliding window with cross-PoP synchronization."

Option C: Single-region source of truth with caching

Route all rate limit checks to one primary region (us-east). Cache the result for 1 second locally. For a 100 req/min limit, 1 second of caching means at most 1.67 extra requests. Acceptable for most use cases. The latency hit (50-80ms cross-region) is the downside.

I would use Option B for global APIs (Cloudflare's approach) and Option C for services where slight latency is acceptable (internal platform services).

Deep Dive 3: What Happens When Redis Goes Down

This is the question that separates mid-level from senior answers. Your rate limiter depends on Redis. Redis goes down. What now?

Fail-open (allow all traffic)

The safest default for user experience. If you cannot check the rate, let the request through. The downside: during a Redis outage, you have zero rate limiting. An attacker who notices the outage can flood your backend.

Fail-closed (reject all traffic)

The safest for backend protection. But if Redis goes down for 2 minutes, every API request returns 429 for 2 minutes. Your users think the API is down. This is almost always wrong.

The right answer: fail-open with local fallback

try:
    allowed = check_redis_rate_limit(user_id, endpoint)
except RedisConnectionError:
    allowed = check_local_rate_limit(user_id, endpoint, 
                                      limit=configured_limit / num_servers)

When Redis is unreachable, fall back to local in-memory counters. Set the local limit to global_limit / num_servers -- if you have 50 servers and the global limit is 100 req/min, each server allows 2 req/min locally. The total across all servers is approximately the global limit.

This is imprecise (a user hitting only one server gets 2 req/min, not 100), but it provides some protection during the outage and never causes a full API blackout.

Circuit breaker on Redis calls: If Redis fails 3 consecutive times within 10 seconds, stop trying for 30 seconds (circuit open). Retry one request every 30 seconds to check if Redis is back (half-open). This prevents thousands of failed Redis calls from adding latency.

Alternative Designs

Alternative 1: API Gateway-Native Rate Limiting

Use your cloud provider's built-in rate limiting (AWS API Gateway, Kong, Envoy). These use token bucket internally and store state in their own distributed backend.

Pros: zero custom code, managed scaling, works out of the box. Cons: limited algorithm choice, harder to customize per-endpoint logic, vendor lock-in on rate limiting configuration.

Alternative 2: Application-Level with Middleware

Rate limiting as a library/middleware in each service (like Guava's RateLimiter in Java). No external dependency.

Pros: no Redis dependency, zero network latency, per-process rate limiting. Cons: no coordination between instances (each server independently limits), so actual throughput is limit * num_servers.

Alternative 3: Sidecar Proxy (Envoy/Istio)

Rate limiting in a sidecar proxy (Envoy) with a centralized rate limit service (Lyft's ratelimit service). Each request goes through the local Envoy sidecar, which calls the rate limit service.

Pros: language-agnostic, consistent across all services, Envoy handles the complexity. Cons: extra hop (sidecar -> rate limit service -> Redis), operational complexity of running Envoy fleet.

Aspect	Centralized Redis	API Gateway Native	App Middleware	Sidecar Proxy
Accuracy	High (single source)	High (managed)	Low (per-instance only)	High (centralized)
Latency overhead	0.2-0.5ms	Built into gateway	~0ms	0.5-1ms
Custom algorithms	Full control	Limited	Full control	Limited to Envoy config
Multi-region support	Build it yourself	Provider handles it	None	Envoy mesh handles it
Ops complexity	Medium (Redis Cluster)	Low (managed)	Low	High (Envoy + service)
Redis failure mode	Need fallback logic	Managed HA	N/A	Need fallback logic

Scaling Math Verification

1M requests/sec distributed across 50 API servers:

Per server: 20K req/sec
Redis calls: 1M/sec (one per request for centralized approach)
Redis Cluster with 6 shards: ~167K calls/shard/sec. Each shard handles 200K+ ops/sec. Headroom: 20%.
Network bandwidth: 1M * 200 bytes (request + response) = 200 MB/sec. Well within 10Gbps NIC.

Memory at peak:

5M active users in any given window * 80 bytes per key = 400 MB
With token bucket (2 fields per key): 5M * 120 bytes = 600 MB
Redis Cluster overhead: ~1.5x data size = 900 MB total
A 2 GB Redis instance per shard handles this easily

Local counter sync overhead:

50 servers syncing every 1 second: 50 sync operations/sec to Redis
Each sync: batch HMSET of ~1000 keys = 50K key updates/sec
This is 0.025% of Redis capacity. Negligible.

Failure Analysis

Component	Current capacity	At 10x (10M req/sec)	Breaks?	Fix
Redis Cluster (6)	200K ops/shard/sec	1.67M ops/shard/sec	Yes	Scale to 60 shards or use local counters
Memory per shard	2 GB	6 GB (50M active users)	No	Larger instances
API gateway latency	0.3ms per rate check	Same	No	--
Network bandwidth	200 MB/sec	2 GB/sec	Maybe	Multiple NICs, reduce payload size
Cross-region sync	50 syncs/sec	500 syncs/sec	No	--
Rule configuration	etcd with 1000 rules	10K rules	No	--

The first thing to break at 10x is Redis throughput per shard. The fix is straightforward: shift to the hybrid local+sync approach for most traffic, using centralized Redis only for critical endpoints (authentication, payment). At 10M req/sec, you cannot afford a Redis call for every request anyway -- the local counter approach with 1-second sync gives you 95% accuracy with zero Redis latency on the hot path.

At 100x (100M req/sec), you are Cloudflare. At that point, rate limiting happens at the network edge using eBPF programs in the kernel, not in application space. The counting moves to shared memory segments that the kernel networking stack can access without context switching to userspace. This is how Cloudflare's "magic firewall" works.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
Algorithms	Knows token bucket or fixed window	Compares 3+ algorithms with trade-offs	All 5 algorithms, explains why sliding window counter wins
Atomicity	Uses INCR	Identifies INCR+EXPIRE race, mentions Lua	Writes the Lua script, explains Redis single-threaded execution
Distribution	"Use Redis"	Sharding by user, discusses hot keys	Local+sync hybrid, cross-region sync strategies
Failure handling	Not mentioned	Fail-open vs fail-closed	Local fallback with divided limits, circuit breaker on Redis
Multi-dimensional limiting	Single dimension (user ID)	Mentions per-endpoint limits	Stripe's layered hierarchy, concurrent request limiting
Configuration	Hardcoded limits	Config file or database	Dynamic rule engine with etcd push, per-tenant overrides
Headers	Returns 429	Mentions X-RateLimit headers	Retry-After with jitter to prevent thundering herd on retry
Real-world reference	None	"Stripe does X"	Cloudflare's sliding window, Envoy's rate limit service, why leaky bucket is wrong for APIs

The single most important signal at any level: do you understand that the hard part is not the algorithm -- it is making the algorithm work correctly when the same user's requests are hitting 50 different servers simultaneously? Mid-level candidates spend 30 minutes on token bucket. Senior candidates spend 5 minutes on the algorithm and 25 minutes on distribution, failure modes, and configuration management.

References from Our Courses

Redis Interview Patterns — sliding window and token bucket implementations in Redis
Traffic Control — backpressure and load shedding strategies
Kafka Partitions and Ordering — async counter synchronization across rate limiter nodes

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Distributed Rate Limiter →