Traffic Control — Token Bucket and Leaky Bucket at Scale
TL;DR
Token bucket and leaky bucket are the two fundamental algorithms for rate limiting. Token bucket: tokens are added at rate r, bucket holds max b tokens, each request consumes one token, bursts up to b are allowed. Leaky bucket: requests enter a queue, processed at a fixed rate, smooths traffic, no bursts. At scale, distributed rate limiting uses Redis + Lua for atomicity, with a local-plus-global strategy for low latency. Stripe uses hierarchical rate limiting (global, per-service, per-user). The choice between the two depends on whether you want to allow bursts (token bucket) or enforce smooth traffic (leaky bucket).
The Problem
You run an API that handles 10,000 requests per second. A single client starts sending 50,000 requests per second. Without rate limiting, that client consumes 5x its fair share, degrading performance for everyone else. Worse, a DDoS attack could send millions of requests per second, overwhelming your servers entirely.
Rate limiting is not optional for any public API. It protects backend services from overload, ensures fair usage across clients, prevents abuse, and maintains SLA guarantees. Every major API (Stripe, GitHub, Twitter, AWS) has rate limits.
The question is not whether to rate-limit. It is which algorithm to use, how to enforce it in a distributed system, and where to draw the limits.
The Algorithms
Token Bucket
Imagine a bucket that holds tokens. Tokens are added at a steady rate r (tokens per second). The bucket can hold at most b tokens (the burst capacity). Each request consumes one token. If no tokens are available, the request is rejected (or queued).
State:
tokens: current number of tokens (float, 0 to b)
last_refill_time: timestamp of last token addition
On request:
now = current_time()
elapsed = now - last_refill_time
tokens = min(b, tokens + elapsed * r) # add tokens for elapsed time
last_refill_time = now
if tokens >= 1:
tokens -= 1
ALLOW request
else:
REJECT request (HTTP 429 Too Many Requests)
Parameters:
| Parameter | Meaning | Example |
|---|---|---|
| r | Refill rate (tokens/sec) | 10 (10 requests per second) |
| b | Bucket size (max tokens) | 20 (burst of 20 allowed) |
Worked Example: Token Bucket
r = 10 tokens/sec, b = 20
Time 0: tokens = 20 (bucket starts full)
Time 0.00: Burst of 20 requests → 20 tokens consumed. tokens = 0. All allowed.
Time 0.05: 1 request → 0.5 tokens added (0.05 * 10). 0.5 < 1. REJECTED.
Time 0.10: 1 request → 0.5 more added. Total 1.0. tokens = 0. ALLOWED.
Time 0.20: 1 request → 1.0 added. tokens = 0. ALLOWED.
Time 1.00: No requests for 0.8s → 8.0 tokens added. tokens = 8.
Time 1.00: Burst of 8 requests → all allowed. tokens = 0.
Time 2.00: 10 tokens added. tokens = 10.
Key behavior: the token bucket allows bursts (up to b requests at once) but averages to r requests per second over time.
Leaky Bucket
Imagine a bucket with a hole at the bottom. Requests flow in at the top (at any rate). They flow out (are processed) at a fixed rate r. If the bucket is full (queue capacity b), new requests overflow (are rejected).
State:
queue: FIFO queue of requests, max size b
processing_rate: r requests per second
On request:
if queue.size < b:
queue.enqueue(request)
else:
REJECT request (HTTP 429)
Background processor (runs continuously):
while true:
sleep(1 / r)
if queue is not empty:
request = queue.dequeue()
process(request)
Key difference from token bucket: The leaky bucket does not allow bursts. Even if 100 requests arrive simultaneously, they are processed at a steady rate of r per second. This smooths the output traffic.
Comparison
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst behavior | Allows bursts up to b | No bursts (constant output rate) |
| Output rate | Variable (up to b instantly) | Fixed (always r per second) |
| Implementation | Counter-based (lightweight) | Queue-based (heavier) |
| Memory | O(1) per rate limit | O(b) per rate limit (queue) |
| Latency impact | None (instant accept/reject) | Adds queueing delay |
| Use case | APIs (allow occasional bursts) | Network shaping (smooth traffic) |
In practice, token bucket is far more common for API rate limiting. It is simpler (no queue management), uses less memory, and the burst allowance is actually desirable: a legitimate client might make a batch of API calls, wait, then make another batch. The token bucket handles this naturally.
Leaky bucket is more common in network traffic shaping (e.g., network switches smoothing packet rates) and in systems where downstream services cannot tolerate any burst (e.g., a payment processor with a strict TPS limit from its bank).
Fixed Window and Sliding Window
Two additional rate limiting strategies commonly used alongside or instead of bucket algorithms:
Fixed Window:
Divide time into fixed windows (e.g., 1-minute windows).
Count requests in the current window.
If count > limit: reject.
Problem: burst at window boundary.
Window 1 (0:00-0:59): 100 requests at 0:59. Allowed.
Window 2 (1:00-1:59): 100 requests at 1:00. Allowed.
200 requests in 2 seconds! Limit is 100/minute but boundary burst is 2x.
Sliding Window Log:
Keep a log of timestamps of all requests.
On new request:
Remove timestamps older than (now - window_size).
Count remaining timestamps.
If count > limit: reject.
Accurate but expensive: O(n) memory per client, where n = limit.
Sliding Window Counter (hybrid):
Weighted count from previous window + count from current window.
weight = 1 - (elapsed time in current window / window size)
estimate = previous_count * weight + current_count
Less accurate than sliding log but O(1) memory.
Distributed Rate Limiting
The Problem
A single-node rate limiter is straightforward. But your API runs on 50 servers behind a load balancer. If each server has its own token bucket, a client sending to different servers can exceed the global limit (each server allows its share).
Global limit: 100 req/sec
50 servers, each with local limit: 2 req/sec
Client sends 100 req/sec, evenly distributed:
Each server sees 2 req/sec → each allows it.
Client gets 100 req/sec. Correct.
Client sends 100 req/sec, all to one server:
One server sees 100 req/sec → rejects 98. Client gets 2 req/sec.
Other 49 servers see 0. Incorrect! Should get 100 total.
Solution: Centralized Counter (Redis + Lua)
Use Redis as a shared counter. Each API server checks Redis before allowing a request.
-- Redis Lua script for token bucket (atomic execution)
local key = KEYS[1]
local rate = tonumber(ARGV[1]) -- tokens per second
local capacity = tonumber(ARGV[2]) -- bucket size
local now = tonumber(ARGV[3]) -- current timestamp
local requested = tonumber(ARGV[4]) -- tokens requested (usually 1)
local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * rate)
-- Check and consume
if tokens >= requested then
tokens = tokens - requested
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / rate) * 2)
return 1 -- allowed
else
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / rate) * 2)
return 0 -- rejected
end
Why Lua? Redis executes Lua scripts atomically. Without Lua, a race condition exists: server A reads tokens=1, server B reads tokens=1, both decrement, tokens goes to -1. The Lua script makes the read-check-decrement atomic.
Local + Global Strategy
The Redis round-trip adds ~1ms to every request. For latency-sensitive APIs, this is too much. The local-plus-global strategy reduces the overhead:
Each API server:
1. Maintains a local token bucket (fast, no network).
2. Periodically synchronizes with the global counter in Redis.
Approach 1: Local quota allocation
Global limit: 1000 req/sec. 50 servers.
Each server gets a local quota of 20 req/sec (1000/50).
Periodically rebalance quotas based on actual traffic distribution.
Approach 2: Local counter with periodic sync
Each server counts locally for 1 second.
Every second, reports its count to Redis and gets back "allowed/denied" for the next second.
Approximate but avoids per-request Redis calls.
The local approach is less precise (a server might temporarily allow more than its quota before the next sync) but eliminates the Redis round-trip from the hot path.
Stripe's Hierarchical Rate Limiting
Stripe's approach is worth studying because it represents production-grade rate limiting for a financial API:
Level 1: Global rate limit
Entire API: 10,000 req/sec across all clients.
Protects Stripe's infrastructure from total overload.
Level 2: Per-merchant rate limit
Each Stripe merchant: 100 req/sec default.
Can be increased for high-volume merchants.
Level 3: Per-endpoint rate limit
POST /v1/charges: 50 req/sec per merchant.
GET /v1/customers: 200 req/sec per merchant.
Write endpoints have lower limits than read endpoints.
Level 4: Per-IP rate limit
Unauthenticated requests: 20 req/sec per IP.
Prevents brute-force attacks on the auth endpoint.
Each level uses a token bucket. A request must pass ALL levels. This creates defense in depth: even if a per-merchant limit is misconfigured, the global limit provides a backstop.
Proof/Correctness Intuition
Token Bucket: Why It Enforces Average Rate
Over any time period T, the maximum number of tokens generated is r * T + b (rate * time + initial burst). Therefore, the maximum number of requests is r * T + b. As T grows large, this approaches r * T, meaning the long-run average rate is r.
The burst parameter b controls the short-term deviation. A small b makes the rate limiter strict (nearly constant rate). A large b allows significant bursts but still enforces the long-term average.
Leaky Bucket: Why It Smooths Traffic
The output rate is exactly r per second, regardless of input rate. Even if 1000 requests arrive in 1 millisecond, they are dequeued at rate r. The queue absorbs the burst. If the queue overflows, excess requests are dropped. The output is always smooth.
Real-World Usage
| Company | Strategy | Implementation |
|---|---|---|
| Stripe | Hierarchical (global/merchant/endpoint/IP) | Token bucket, Redis |
| GitHub | Per-user, per-endpoint, per-IP | Sliding window |
| Twitter/X | Per-user, per-app, per-endpoint | Token bucket |
| AWS API GW | Per-API, per-stage, per-method | Token bucket |
| Cloudflare | Per-IP, per-path, per-region | Leaky bucket variant |
| Kong | Plugin-based, configurable per route | Redis-backed token bucket |
HTTP rate limit headers (RFC 6585 + draft-ietf-httpapi-ratelimit-headers):
HTTP/1.1 429 Too Many Requests
Retry-After: 1
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1618884480
Good APIs return these headers on EVERY response (not just 429s) so clients can self-throttle.
Interview Application
When to mention rate limiting:
- "Design an API gateway." -- Rate limiting is a core feature. Token bucket per user, per endpoint.
- "How do you protect a service from a DDoS attack?" -- Rate limiting per IP at the edge (CDN/load balancer level).
- "How would you design a notification system that sends at most 5 emails per hour?" -- Token bucket with r=5/3600, b=5.
- "How do you handle API abuse?" -- Hierarchical rate limiting. Global + per-user + per-IP.
What interviewers want to hear:
- You know the difference between token bucket (allows bursts) and leaky bucket (smooth output).
- You understand the distributed rate limiting problem (single Redis + Lua for atomicity).
- You can discuss the local-plus-global strategy for latency reduction.
- You know the HTTP 429 response code and Retry-After header.
- You can design a multi-level rate limiting system (Stripe's approach).
Trade-offs

| Aspect | Token Bucket | Leaky Bucket | Sliding Window |
|---|---|---|---|
| Burst handling | Allows (configurable) | Prevents | Depends on variant |
| Memory per key | O(1) | O(b) (queue) | O(1) counter / O(n) log |
| Precision | Approximate | Exact output rate | Exact (log) / Approximate (counter) |
| Latency | None (instant decision) | Queueing delay | None |
| Implementation | Simple | Moderate | Simple to moderate |
| Distributed | Redis counter | Harder (shared queue) | Redis counter |
Where to Rate-Limit
Edge (CDN / Load Balancer):
Per-IP rate limiting. Stops DDoS before traffic reaches your servers.
Tools: Cloudflare, AWS WAF, nginx limit_req module.
API Gateway:
Per-user, per-endpoint rate limiting. Enforces business limits.
Tools: Kong, AWS API Gateway, Envoy.
Service Level:
Per-resource rate limiting. Protects specific backends.
Tools: Application-level middleware, Resilience4j, Guava RateLimiter.
Rate limiting at multiple layers provides defense in depth. Edge rate limiting is coarse (IP-based). API gateway rate limiting is fine-grained (user, endpoint, plan). Service-level rate limiting protects individual components.
Common Mistakes
Rate limiting only the API endpoint is sufficient
A client can bypass API rate limits by using multiple IPs, multiple API keys, or targeting unauthenticated endpoints. You need rate limiting at multiple layers: IP-level at the edge, user-level at the API gateway, and resource-level at the service.
Token bucket and leaky bucket are the same thing
They have fundamentally different burst behavior. Token bucket allows a burst of b requests instantly. Leaky bucket queues requests and processes them at a constant rate. For an API that should tolerate occasional bursts (batch operations), use token bucket. For a downstream system that cannot handle any burst (payment gateway with strict TPS), use leaky bucket.
Distributed rate limiting with Redis is too slow
A Redis round-trip is ~1ms from the same datacenter. For an API with 100ms response time, 1ms is 1% overhead. For most APIs, this is acceptable. For latency-critical paths, use the local-plus-global strategy (local counters with periodic Redis sync).
Set the rate limit to the server's maximum capacity
The rate limit should be lower than your maximum capacity to leave headroom for retries, traffic spikes, and graceful degradation. If your server can handle 1000 req/sec, set the per-user limit to 100 and accept at most 10 heavy users simultaneously.
Rate limiting returns 429 with no information
Always return Retry-After header (how many seconds to wait) and RateLimit-Remaining (how many requests remain). Good clients use these headers to self-throttle, reducing load on your rate limiter. Bad error responses cause clients to retry aggressively, making the overload worse.
Fixed window rate limiting is good enough
Fixed window has the boundary burst problem: a client can send 2x the limit in a short period by timing requests at the boundary of two windows. Sliding window or token bucket avoids this. Fixed window is only acceptable if your system can tolerate occasional 2x bursts.