API Gateway — Rate Limiting, Auth, and Routing

TL;DR

An API gateway is the front door to your backend. It handles rate limiting, authentication, routing, and SSL termination so your services don't have to. Without one, every service reinvents the same cross-cutting concerns badly.

What It Is

Gateway Responsibilities

An API gateway sits between clients and your backend services. Every request flows through it. It acts as a single entry point that handles the boring-but-critical stuff: who are you, are you allowed, how fast can you go, and which service gets your request.

This is different from a load balancer. A load balancer distributes traffic across instances of the same service. An API gateway routes traffic to different services based on the request path, adds authentication context, enforces rate limits, and transforms requests. A load balancer is a traffic cop. An API gateway is a concierge.

Netflix routes all API traffic through their Zuul gateway. It handles authentication, rate limiting, request routing, canary deployments, and logging. Before Zuul, each backend team implemented these concerns independently — inconsistently, with bugs, and with no central visibility. The gateway consolidated all of it.

Why Not Just a Load Balancer

This question comes up in every interview where you mention a gateway. Be ready.

A load balancer operates at Layer 4 (TCP) or Layer 7 (HTTP). It can route based on IP, port, or URL path. It can do health checks and distribute traffic. But it doesn't understand your application.

What a load balancer CAN do:
  ✓ Route /api/* to backend pool A
  ✓ Health check: remove unhealthy instances
  ✓ SSL termination
  ✓ Round-robin or least-connections distribution

What a load balancer CANNOT do:
  ✗ Validate a JWT and extract user claims
  ✗ Rate limit per-user (it doesn't know who the user is)
  ✗ Transform request/response bodies
  ✗ Aggregate responses from multiple backends
  ✗ Route 5% of user-123's traffic to a canary deployment
  ✗ Add request tracing headers (correlation IDs)

In practice, you need both. The gateway handles application-level concerns. The load balancer handles instance-level distribution. They sit at different points in the request path.

Request Flow — Where the Gateway Sits

Understanding the full request path is critical for interviews. Here's how a request travels from browser to backend.

Client (browser/mobile)
  ↓
DNS resolution (Route 53, Cloudflare DNS)
  ↓
CDN (CloudFront, Cloudflare) — serves static assets,
  caches GET responses
  ↓
API Gateway — authenticates, rate limits, routes
  ↓
Load Balancer — distributes to healthy instances
  ↓
Service instance — processes the request
  ↓
Response travels back up the same path

Some architectures merge the gateway and load balancer. AWS API Gateway does both. Kong can do both. In a Kubernetes cluster, an Ingress controller often serves as both gateway and load balancer. The logical responsibilities remain the same even when the physical components merge.

Simplified Kubernetes flow:

Client → Ingress Controller (gateway + LB) → Service → Pod

Gotcha

Don't put a CDN in front of POST/PUT requests. CDNs cache GET responses. Mutation requests bypass the CDN and hit the gateway directly. Interviewers will test whether you understand which requests benefit from a CDN.

Rate Limiting at the Gateway

Rate limiting is the gateway's most visible job. Without it, one misbehaving client can overwhelm your entire backend.

Why Rate Limit at the Gateway

If each service implements its own rate limiting, you get inconsistency. The user service limits at 100 req/s. The order service limits at 50 req/s. The payment service has no limit at all (oops). A malicious actor discovers the unprotected payment service and hammers it.

Central rate limiting at the gateway means one policy, one place, one enforcement point. Services behind the gateway trust that traffic has already been throttled.

Rate Limiting Strategies

Per-user: each authenticated user gets a quota. The free tier gets 100 requests per minute. The premium tier gets 10,000.

Per-IP: unauthenticated requests are limited by source IP. Prevents brute-force login attempts and DDoS from single sources.

Per-endpoint: sensitive endpoints get tighter limits. /api/login allows 5 attempts per minute. /api/search allows 100.

Global: total requests across all users. Protects against coordinated attacks or sudden viral traffic.

Token Bucket Algorithm

The most common rate limiting algorithm. Simple to understand, simple to implement, handles bursts gracefully.

Token Bucket:

Bucket capacity: 10 tokens
Refill rate: 1 token per second

t=0:  bucket has 10 tokens
t=0:  request arrives → consume 1 token → 9 remaining → ALLOW
t=0:  request arrives → consume 1 token → 8 remaining → ALLOW
...
t=0:  10 requests → 0 tokens remaining
t=0:  11th request → no tokens → REJECT (429 Too Many Requests)
t=1:  1 token added → 1 remaining
t=1:  request arrives → consume 1 token → 0 remaining → ALLOW

# Token bucket implementation (Redis-based for distributed systems)
import time
import redis

r = redis.Redis()

def is_allowed(user_id, capacity=10, refill_rate=1):
    key = f"ratelimit:{user_id}"
    now = time.time()

    pipe = r.pipeline()
    pipe.hgetall(key)
    result = pipe.execute()[0]

    if not result:
        # First request — initialize bucket
        r.hset(key, mapping={
            'tokens': capacity - 1,
            'last_refill': now
        })
        r.expire(key, 3600)
        return True

    tokens = float(result[b'tokens'])
    last_refill = float(result[b'last_refill'])

    # Refill tokens based on elapsed time
    elapsed = now - last_refill
    tokens = min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1:
        r.hset(key, mapping={
            'tokens': tokens - 1,
            'last_refill': now
        })
        return True
    else:
        return False  # 429

Sliding Window Algorithm

More precise than token bucket for strict per-minute limits. Counts exact requests in a sliding time window.

# Sliding window with Redis sorted sets
def is_allowed_sliding(user_id, limit=100, window_seconds=60):
    key = f"ratelimit:sliding:{user_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = r.pipeline()
    # Remove expired entries
    pipe.zremrangebyscore(key, 0, window_start)
    # Count remaining entries
    pipe.zcard(key)
    # Add current request
    pipe.zadd(key, {f"{now}:{uuid4()}": now})
    # Set expiry on the key itself
    pipe.expire(key, window_seconds)
    results = pipe.execute()

    request_count = results[1]

    if request_count >= limit:
        return False  # 429
    return True

The 429 Response

When rate limiting kicks in, return HTTP 429 with a Retry-After header. Clients that respect this header back off automatically. Clients that don't respect it get increasingly aggressive limiting.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1681234590

{
    "error": "Rate limit exceeded",
    "message": "Try again in 30 seconds"
}

GitHub's API returns these headers on every response. Developers can proactively check their remaining quota before hitting the limit. Good API design exposes rate limit state, not just the rejection.

Authentication at the Gateway

The gateway validates identity so backend services don't have to. Each service receives pre-authenticated requests with user context injected as headers.

JWT Validation

The most common pattern. Client sends a JWT in the Authorization header. Gateway verifies the signature, checks expiration, and extracts claims.

Request flow:

1. Client sends: Authorization: Bearer eyJhbGciOiJSUz...
2. Gateway:
   - Decode the JWT header → find the signing key ID
   - Fetch the public key from the auth service (cached)
   - Verify signature → proves token wasn't tampered with
   - Check exp claim → reject if expired
   - Check iss claim → reject if wrong issuer
   - Extract claims: user_id, email, role
3. Gateway forwards to backend with:
   X-User-Id: 12345
   X-User-Email: alice@example.com
   X-User-Role: admin
4. Backend trusts these headers — no re-verification needed

# Gateway JWT validation (simplified)
import jwt

def validate_request(request):
    token = request.headers.get('Authorization', '').replace(
        'Bearer ', '')

    if not token:
        return 401, {"error": "Missing authorization token"}

    try:
        claims = jwt.decode(
            token,
            public_key,
            algorithms=['RS256'],
            audience='my-api',
            issuer='auth.example.com'
        )
    except jwt.ExpiredSignatureError:
        return 401, {"error": "Token expired"}
    except jwt.InvalidTokenError:
        return 401, {"error": "Invalid token"}

    # Inject user context for downstream services
    request.headers['X-User-Id'] = claims['sub']
    request.headers['X-User-Role'] = claims['role']

    return forward_to_backend(request)

Gotcha

Backend services must only accept requests from the gateway — not directly from the internet. Otherwise, anyone can set X-User-Id: admin headers and bypass auth. Network policies (security groups, Kubernetes NetworkPolicies) must block direct access to services.

API Key Validation

For server-to-server communication. The gateway looks up the API key in a store (Redis, database), maps it to a client identity, and applies per-client rate limits.

# API key validation
def validate_api_key(request):
    api_key = request.headers.get('X-API-Key')

    if not api_key:
        return 401, {"error": "Missing API key"}

    # Lookup in Redis (fast) or database (authoritative)
    client = redis.hgetall(f"apikey:{api_key}")

    if not client:
        return 401, {"error": "Invalid API key"}

    if client['status'] != 'active':
        return 403, {"error": "API key revoked"}

    # Apply client-specific rate limit
    if not is_allowed(client['client_id'],
                      capacity=int(client['rate_limit'])):
        return 429, {"error": "Rate limit exceeded"}

    request.headers['X-Client-Id'] = client['client_id']
    return forward_to_backend(request)

Request Transformation

The gateway can modify requests and responses. This keeps services focused on business logic instead of client compatibility.

Path Rewriting

External URLs don't need to match internal service routes.

External:  GET /api/v2/users/123/orders
Internal:  GET /orders?user_id=123

Gateway rule:
  /api/v2/users/:id/orders → order-service:/orders?user_id=:id

This lets you restructure your internal services without breaking client URLs. The gateway absorbs the mapping.

Header Injection

Add tracing headers, correlation IDs, and auth context.

Incoming request:
  GET /api/products/456
  Authorization: Bearer eyJ...

Gateway adds:
  X-Request-Id: 7f3a-4b2c-9d1e
  X-User-Id: 12345
  X-Trace-Id: abc123def456
  X-Forwarded-For: 203.0.113.50

Backend receives all original headers PLUS the injected ones

BFF Pattern — Backend for Frontend

The gateway aggregates multiple backend calls into a single response. Mobile clients make one request. The gateway fans out to multiple services and merges results.

Mobile client: GET /api/homepage

Gateway calls (in parallel):
  → user-service:  GET /users/123        → {name, avatar}
  → feed-service:  GET /feed?user=123    → [posts...]
  → notif-service: GET /notifications/123 → {unread: 5}

Gateway merges:
{
  "user": {"name": "Alice", "avatar": "..."},
  "feed": [...],
  "notifications": {"unread": 5}
}

Mobile client receives ONE response instead of THREE round trips

Airbnb uses this pattern heavily. Their mobile apps make a single API call per screen. The gateway (or BFF layer) handles the fan-out to dozens of backend services. This reduces mobile latency dramatically — three sequential HTTP calls over cellular is brutal.

Here's the spicy opinion: most teams implement BFF too early. If you have three services, just let the client make three calls. The BFF pattern adds a layer of indirection that needs maintenance. Wait until you have 10+ services or severe mobile latency problems before adding BFF.

Canary Deployments via Gateway

The gateway can route a percentage of traffic to a new version of a service. This is how you deploy safely without affecting all users.

Normal state:
  100% of traffic → service v1.0

Canary deployment:
  95% of traffic → service v1.0
   5% of traffic → service v1.1 (canary)

Monitor error rates, latency, business metrics for v1.1:
  All good  → gradually increase to 25%, 50%, 100%
  Problems  → instantly route 100% back to v1.0

# Kong canary configuration (simplified)
- upstream: order-service
  targets:
    - host: order-v1.svc.cluster.local
      weight: 95
    - host: order-v1-1.svc.cluster.local
      weight: 5

The gateway makes this possible because it controls routing. Without a gateway, you'd need to update DNS, modify load balancer rules, or deploy a separate routing layer. The gateway already sits in the request path — canary routing is just a configuration change.

Sticky Canary

Route specific users (not random traffic) to the canary. Useful for testing with internal users first.

# Route employees to canary, everyone else to stable
def route_request(request):
    user_email = request.headers.get('X-User-Email', '')

    if user_email.endswith('@company.com'):
        return 'order-service-canary'
    else:
        return 'order-service-stable'

Patterns for System Design Interviews

Pattern 1: Multi-Tenant API with Tiered Rate Limiting

[Client] → [API Gateway]
               ↓
         Authenticate (JWT or API key)
         Identify tenant + tier
               ↓
         Apply tier-specific rate limit:
           Free:       100 req/min
           Pro:        10,000 req/min
           Enterprise: 100,000 req/min
               ↓
         Route to backend service
               ↓
         [Service] → response → Gateway → Client

The gateway holds the tier-to-limit mapping in Redis. Each request costs one token from the tenant's bucket. This is exactly how Twilio, Stripe, and OpenAI structure their API rate limiting.

Pattern 2: API Versioning at the Gateway

Client sends: GET /api/v2/users/123

Gateway routing rules:
  /api/v1/* → user-service-legacy (old codebase)
  /api/v2/* → user-service (current codebase)
  /api/v3/* → user-service-next (beta, internal only)

The gateway strips the version prefix before forwarding:
  /api/v2/users/123 → user-service: GET /users/123

Services don't know about versions. The gateway maps external versions to internal services. When v1 is deprecated, you remove the routing rule. When v3 graduates, you rename the route. The services themselves never change URLs.

Pattern 3: Circuit Breaker at the Gateway

[Client] → [Gateway] → [Service A]
                          ↓ (5 consecutive failures)
                    Gateway opens circuit breaker
                          ↓
                    Next requests get instant 503
                    "Service temporarily unavailable"
                          ↓ (after 30 seconds)
                    Gateway tries one request (half-open)
                          ↓
                    Success → close circuit → resume normal
                    Failure → keep circuit open → wait longer

The gateway protects overwhelmed services by short-circuiting requests. Without this, a failing service gets hammered by retries from every client, making recovery harder. Netflix's Zuul implements circuit breakers for every backend service.

Trade-offs Table

Trade-off	Choose A	Choose B
Centralized vs Distributed auth	Gateway auth (single point, consistent)	Per-service auth (no SPOF, but inconsistent)
Strict vs Lenient rate limiting	Token bucket (allows bursts up to bucket size)	Sliding window (strict per-second enforcement)
Gateway BFF vs Client fan-out	BFF (one round trip, complex gateway)	Direct calls (simpler gateway, more latency)
Canary at gateway vs Blue/Green	Canary (gradual rollout, complex routing)	Blue/Green (instant switch, simpler but riskier)
Managed vs Self-hosted gateway	AWS API Gateway (zero ops, limited customization)	Kong/NGINX (full control, ops overhead)
Fat gateway vs Thin gateway	Fat (auth + rate limit + transform + BFF)	Thin (routing + SSL only, logic in services)

Request Flow

Interview Gotchas

Gotcha 1: Gateway Is Not Optional for Microservices

If you design a microservices architecture without a gateway, the interviewer will ask: "Where does rate limiting happen? Where does authentication happen? How do clients discover services?" Without a gateway, every service must implement auth, rate limiting, CORS, and logging independently. That's a maintenance nightmare.

Gotcha 2: The Gateway Is a Single Point of Failure

Yes, the gateway is a SPOF. That's why production gateways run as a cluster behind a load balancer or DNS failover. Multiple gateway instances, health checks, auto-scaling. If you mention a gateway, mention its HA strategy too.

Gotcha 3: Don't Put Business Logic in the Gateway

The gateway handles cross-cutting concerns: auth, rate limiting, routing, SSL. It should NOT contain business logic like "if order total > $500, require manager approval." That belongs in the service. A gateway with business logic becomes a monolith in disguise.

Gotcha 4: Rate Limiting Needs Distributed State

If your gateway runs as 5 instances, each instance needs to see the same rate limit counters. A user who hits instance 1 and then instance 2 shouldn't get double the quota. Use Redis as the shared counter store. In-memory counters per instance will give inconsistent limits.

Gotcha 5: Auth at the Gateway Doesn't Mean No Auth Downstream

The gateway authenticates — it verifies the token is valid and injects user context. But authorization — "can this user access this resource?" — often belongs in the service. The gateway can't know that user 123 owns document 456. Service-level authorization checks are still needed.