Traffic Control — Token Bucket and Leaky Bucket at Scale

TL;DR

Token bucket and leaky bucket are the two fundamental algorithms for rate limiting. Token bucket: tokens are added at rate r, bucket holds max b tokens, each request consumes one token, bursts up to b are allowed. Leaky bucket: requests enter a queue, processed at a fixed rate, smooths traffic, no bursts. At scale, distributed rate limiting uses Redis + Lua for atomicity, with a local-plus-global strategy for low latency. Stripe uses hierarchical rate limiting (global, per-service, per-user). The choice between the two depends on whether you want to allow bursts (token bucket) or enforce smooth traffic (leaky bucket).

The Problem

You run an API that handles 10,000 requests per second. A single client starts sending 50,000 requests per second. Without rate limiting, that client consumes 5x its fair share, degrading performance for everyone else. Worse, a DDoS attack could send millions of requests per second, overwhelming your servers entirely.

Rate limiting is not optional for any public API. It protects backend services from overload, ensures fair usage across clients, prevents abuse, and maintains SLA guarantees. Every major API (Stripe, GitHub, Twitter, AWS) has rate limits.

The question is not whether to rate-limit. It is which algorithm to use, how to enforce it in a distributed system, and where to draw the limits.

The Algorithms

Token Bucket

Imagine a bucket that holds tokens. Tokens are added at a steady rate r (tokens per second). The bucket can hold at most b tokens (the burst capacity). Each request consumes one token. If no tokens are available, the request is rejected (or queued).

State:
  tokens: current number of tokens (float, 0 to b)
  last_refill_time: timestamp of last token addition

On request:
  now = current_time()
  elapsed = now - last_refill_time
  tokens = min(b, tokens + elapsed * r)   # add tokens for elapsed time
  last_refill_time = now

  if tokens >= 1:
    tokens -= 1
    ALLOW request
  else:
    REJECT request (HTTP 429 Too Many Requests)

Parameters:

Parameter	Meaning	Example
r	Refill rate (tokens/sec)	10 (10 requests per second)
b	Bucket size (max tokens)	20 (burst of 20 allowed)

Worked Example: Token Bucket

r = 10 tokens/sec, b = 20
Time 0: tokens = 20 (bucket starts full)

Time 0.00: Burst of 20 requests → 20 tokens consumed. tokens = 0. All allowed.
Time 0.05: 1 request → 0.5 tokens added (0.05 * 10). 0.5 < 1. REJECTED.
Time 0.10: 1 request → 0.5 more added. Total 1.0. tokens = 0. ALLOWED.
Time 0.20: 1 request → 1.0 added. tokens = 0. ALLOWED.
Time 1.00: No requests for 0.8s → 8.0 tokens added. tokens = 8.
Time 1.00: Burst of 8 requests → all allowed. tokens = 0.
Time 2.00: 10 tokens added. tokens = 10.

Key behavior: the token bucket allows bursts (up to b requests at once) but averages to r requests per second over time.

Leaky Bucket

Imagine a bucket with a hole at the bottom. Requests flow in at the top (at any rate). They flow out (are processed) at a fixed rate r. If the bucket is full (queue capacity b), new requests overflow (are rejected).

State:
  queue: FIFO queue of requests, max size b
  processing_rate: r requests per second

On request:
  if queue.size < b:
    queue.enqueue(request)
  else:
    REJECT request (HTTP 429)

Background processor (runs continuously):
  while true:
    sleep(1 / r)
    if queue is not empty:
      request = queue.dequeue()
      process(request)

Key difference from token bucket: The leaky bucket does not allow bursts. Even if 100 requests arrive simultaneously, they are processed at a steady rate of r per second. This smooths the output traffic.

Comparison

Aspect	Token Bucket	Leaky Bucket
Burst behavior	Allows bursts up to b	No bursts (constant output rate)
Output rate	Variable (up to b instantly)	Fixed (always r per second)
Implementation	Counter-based (lightweight)	Queue-based (heavier)
Memory	O(1) per rate limit	O(b) per rate limit (queue)
Latency impact	None (instant accept/reject)	Adds queueing delay
Use case	APIs (allow occasional bursts)	Network shaping (smooth traffic)

In practice, token bucket is far more common for API rate limiting. It is simpler (no queue management), uses less memory, and the burst allowance is actually desirable: a legitimate client might make a batch of API calls, wait, then make another batch. The token bucket handles this naturally.

Leaky bucket is more common in network traffic shaping (e.g., network switches smoothing packet rates) and in systems where downstream services cannot tolerate any burst (e.g., a payment processor with a strict TPS limit from its bank).

Fixed Window and Sliding Window

Two additional rate limiting strategies commonly used alongside or instead of bucket algorithms:

Fixed Window:

Divide time into fixed windows (e.g., 1-minute windows).
Count requests in the current window.
If count > limit: reject.

Problem: burst at window boundary.
  Window 1 (0:00-0:59): 100 requests at 0:59. Allowed.
  Window 2 (1:00-1:59): 100 requests at 1:00. Allowed.
  200 requests in 2 seconds! Limit is 100/minute but boundary burst is 2x.

Sliding Window Log:

Keep a log of timestamps of all requests.
On new request:
  Remove timestamps older than (now - window_size).
  Count remaining timestamps.
  If count > limit: reject.

Accurate but expensive: O(n) memory per client, where n = limit.

Sliding Window Counter (hybrid):

Weighted count from previous window + count from current window.
  weight = 1 - (elapsed time in current window / window size)
  estimate = previous_count * weight + current_count

Less accurate than sliding log but O(1) memory.

Distributed Rate Limiting

The Problem

A single-node rate limiter is straightforward. But your API runs on 50 servers behind a load balancer. If each server has its own token bucket, a client sending to different servers can exceed the global limit (each server allows its share).

Global limit: 100 req/sec
50 servers, each with local limit: 2 req/sec

Client sends 100 req/sec, evenly distributed:
  Each server sees 2 req/sec → each allows it.
  Client gets 100 req/sec. Correct.

Client sends 100 req/sec, all to one server:
  One server sees 100 req/sec → rejects 98. Client gets 2 req/sec.
  Other 49 servers see 0. Incorrect! Should get 100 total.

Solution: Centralized Counter (Redis + Lua)

Use Redis as a shared counter. Each API server checks Redis before allowing a request.

-- Redis Lua script for token bucket (atomic execution)
local key = KEYS[1]
local rate = tonumber(ARGV[1])       -- tokens per second
local capacity = tonumber(ARGV[2])   -- bucket size
local now = tonumber(ARGV[3])        -- current timestamp
local requested = tonumber(ARGV[4])  -- tokens requested (usually 1)

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * rate)

-- Check and consume
if tokens >= requested then
    tokens = tokens - requested
    redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / rate) * 2)
    return 1  -- allowed
else
    redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / rate) * 2)
    return 0  -- rejected
end

Why Lua? Redis executes Lua scripts atomically. Without Lua, a race condition exists: server A reads tokens=1, server B reads tokens=1, both decrement, tokens goes to -1. The Lua script makes the read-check-decrement atomic.

Local + Global Strategy

The Redis round-trip adds ~1ms to every request. For latency-sensitive APIs, this is too much. The local-plus-global strategy reduces the overhead:

Each API server:
  1. Maintains a local token bucket (fast, no network).
  2. Periodically synchronizes with the global counter in Redis.

Approach 1: Local quota allocation
  Global limit: 1000 req/sec. 50 servers.
  Each server gets a local quota of 20 req/sec (1000/50).
  Periodically rebalance quotas based on actual traffic distribution.

Approach 2: Local counter with periodic sync
  Each server counts locally for 1 second.
  Every second, reports its count to Redis and gets back "allowed/denied" for the next second.
  Approximate but avoids per-request Redis calls.

The local approach is less precise (a server might temporarily allow more than its quota before the next sync) but eliminates the Redis round-trip from the hot path.

Stripe's Hierarchical Rate Limiting

Stripe's approach is worth studying because it represents production-grade rate limiting for a financial API:

Level 1: Global rate limit
  Entire API: 10,000 req/sec across all clients.
  Protects Stripe's infrastructure from total overload.

Level 2: Per-merchant rate limit
  Each Stripe merchant: 100 req/sec default.
  Can be increased for high-volume merchants.

Level 3: Per-endpoint rate limit
  POST /v1/charges: 50 req/sec per merchant.
  GET /v1/customers: 200 req/sec per merchant.
  Write endpoints have lower limits than read endpoints.

Level 4: Per-IP rate limit
  Unauthenticated requests: 20 req/sec per IP.
  Prevents brute-force attacks on the auth endpoint.

Each level uses a token bucket. A request must pass ALL levels. This creates defense in depth: even if a per-merchant limit is misconfigured, the global limit provides a backstop.

Proof/Correctness Intuition

Token Bucket: Why It Enforces Average Rate

Over any time period T, the maximum number of tokens generated is r * T + b (rate * time + initial burst). Therefore, the maximum number of requests is r * T + b. As T grows large, this approaches r * T, meaning the long-run average rate is r.

The burst parameter b controls the short-term deviation. A small b makes the rate limiter strict (nearly constant rate). A large b allows significant bursts but still enforces the long-term average.

Leaky Bucket: Why It Smooths Traffic

The output rate is exactly r per second, regardless of input rate. Even if 1000 requests arrive in 1 millisecond, they are dequeued at rate r. The queue absorbs the burst. If the queue overflows, excess requests are dropped. The output is always smooth.

Real-World Usage

Company	Strategy	Implementation
Stripe	Hierarchical (global/merchant/endpoint/IP)	Token bucket, Redis
GitHub	Per-user, per-endpoint, per-IP	Sliding window
Twitter/X	Per-user, per-app, per-endpoint	Token bucket
AWS API GW	Per-API, per-stage, per-method	Token bucket
Cloudflare	Per-IP, per-path, per-region	Leaky bucket variant
Kong	Plugin-based, configurable per route	Redis-backed token bucket

HTTP rate limit headers (RFC 6585 + draft-ietf-httpapi-ratelimit-headers):

HTTP/1.1 429 Too Many Requests
Retry-After: 1
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1618884480

Good APIs return these headers on EVERY response (not just 429s) so clients can self-throttle.

Interview Application

When to mention rate limiting:

"Design an API gateway." -- Rate limiting is a core feature. Token bucket per user, per endpoint.
"How do you protect a service from a DDoS attack?" -- Rate limiting per IP at the edge (CDN/load balancer level).
"How would you design a notification system that sends at most 5 emails per hour?" -- Token bucket with r=5/3600, b=5.
"How do you handle API abuse?" -- Hierarchical rate limiting. Global + per-user + per-IP.

What interviewers want to hear:

You know the difference between token bucket (allows bursts) and leaky bucket (smooth output).
You understand the distributed rate limiting problem (single Redis + Lua for atomicity).
You can discuss the local-plus-global strategy for latency reduction.
You know the HTTP 429 response code and Retry-After header.
You can design a multi-level rate limiting system (Stripe's approach).

Trade-offs

Leaky Bucket

Aspect	Token Bucket	Leaky Bucket	Sliding Window
Burst handling	Allows (configurable)	Prevents	Depends on variant
Memory per key	O(1)	O(b) (queue)	O(1) counter / O(n) log
Precision	Approximate	Exact output rate	Exact (log) / Approximate (counter)
Latency	None (instant decision)	Queueing delay	None
Implementation	Simple	Moderate	Simple to moderate
Distributed	Redis counter	Harder (shared queue)	Redis counter

Where to Rate-Limit

Edge (CDN / Load Balancer):
  Per-IP rate limiting. Stops DDoS before traffic reaches your servers.
  Tools: Cloudflare, AWS WAF, nginx limit_req module.

API Gateway:
  Per-user, per-endpoint rate limiting. Enforces business limits.
  Tools: Kong, AWS API Gateway, Envoy.

Service Level:
  Per-resource rate limiting. Protects specific backends.
  Tools: Application-level middleware, Resilience4j, Guava RateLimiter.

Rate limiting at multiple layers provides defense in depth. Edge rate limiting is coarse (IP-based). API gateway rate limiting is fine-grained (user, endpoint, plan). Service-level rate limiting protects individual components.

Common Mistakes

Rate limiting only the API endpoint is sufficient

A client can bypass API rate limits by using multiple IPs, multiple API keys, or targeting unauthenticated endpoints. You need rate limiting at multiple layers: IP-level at the edge, user-level at the API gateway, and resource-level at the service.

Token bucket and leaky bucket are the same thing

They have fundamentally different burst behavior. Token bucket allows a burst of b requests instantly. Leaky bucket queues requests and processes them at a constant rate. For an API that should tolerate occasional bursts (batch operations), use token bucket. For a downstream system that cannot handle any burst (payment gateway with strict TPS), use leaky bucket.

Distributed rate limiting with Redis is too slow

A Redis round-trip is ~1ms from the same datacenter. For an API with 100ms response time, 1ms is 1% overhead. For most APIs, this is acceptable. For latency-critical paths, use the local-plus-global strategy (local counters with periodic Redis sync).

Set the rate limit to the server's maximum capacity

The rate limit should be lower than your maximum capacity to leave headroom for retries, traffic spikes, and graceful degradation. If your server can handle 1000 req/sec, set the per-user limit to 100 and accept at most 10 heavy users simultaneously.

Rate limiting returns 429 with no information

Always return Retry-After header (how many seconds to wait) and RateLimit-Remaining (how many requests remain). Good clients use these headers to self-throttle, reducing load on your rate limiter. Bad error responses cause clients to retry aggressively, making the overload worse.

Fixed window rate limiting is good enough

Fixed window has the boundary burst problem: a client can send 2x the limit in a short period by timing requests at the boundary of two windows. Sliding window or token bucket avoids this. Fixed window is only acceptable if your system can tolerate occasional 2x bursts.