Skip to content

Sharding

TL;DR

Sharding distributes data across multiple independent database instances so that no single machine becomes a bottleneck for storage, writes, or read latency.

Ten Warehouses Instead of One Giant One

Imagine you run a shipping company. You started with one warehouse, and it worked great. But now you're shipping to 50 states, and that single warehouse is maxed out -- no more shelf space, the loading docks are jammed, and every shipment routes through the same bottleneck.

The solution isn't a bigger warehouse (there's a physical limit). It's ten smaller warehouses, each handling a region. West Coast orders go to the Portland warehouse. Southeast orders go to Atlanta. Each warehouse operates independently, has its own staff, and can be scaled on its own.

That's sharding. Instead of one massive database, you split the data across multiple independent database instances. Each instance -- called a shard -- holds a subset of the total data and handles reads and writes for that subset.

Why Shard at All?

Partitioning (from the previous lesson) splits data on a single machine. Sharding splits it across machines. You reach for sharding when a single machine hits hard limits:

Limit Typical Threshold What Happens
Storage ~256 TB (cloud instances) Can't fit more data on one node
Write throughput ~50K-100K writes/sec (PostgreSQL) Single writer bottleneck, WAL becomes the constraint
Read latency Depends on dataset/memory ratio Working set exceeds RAM, queries hit disk constantly
Connection count ~5,000-10,000 connections App servers competing for limited connection slots

If your traffic is read-heavy, you can often delay sharding by adding read replicas. But write-heavy workloads eventually force the issue -- there's only one primary, and it can only do so much.

Companies that shard: Instagram sharded PostgreSQL early because user growth outpaced single-instance capacity. Notion sharded when their single PostgreSQL instance hit performance walls. Figma, Slack, Pinterest -- the pattern repeats.

Three Sharding Strategies

1. Hash-Based Sharding (Most Common)

Apply a hash function to the shard key and use modulo to pick the shard:

shard_number = hash(shard_key) % number_of_shards
import hashlib

def get_shard(user_id: int, num_shards: int) -> int:
    key_bytes = str(user_id).encode('utf-8')
    hash_value = int(hashlib.md5(key_bytes).hexdigest(), 16)
    return hash_value % num_shards

# User 12345 -> shard 2, User 67890 -> shard 0, etc.
print(get_shard(12345, 4))  # Deterministic: same input always gives same shard

A shard-aware query router is just a few extra lines:

# Shard-aware query router
SHARD_CONNECTIONS = {
    0: db_connect("shard-0.internal"),
    1: db_connect("shard-1.internal"),
    2: db_connect("shard-2.internal"),
    3: db_connect("shard-3.internal"),
}

def query_user(user_id):
    shard = get_shard(user_id, num_shards=4)       # deterministic routing
    conn = SHARD_CONNECTIONS[shard]
    return conn.execute("SELECT * FROM users WHERE id = %s", user_id)

That's the entire sharding proxy concept — hash the key, pick the connection, forward the query.

Pros: Distributes data evenly regardless of key patterns. Simple to implement.

Cons: Adding or removing shards changes the modulo, causing massive data migration. Range queries across shards are expensive.

2. Range-Based Sharding

Assign contiguous ranges of the shard key to each shard:

Shard 0: user_id     1 - 1,000,000
Shard 1: user_id 1,000,001 - 2,000,000
Shard 2: user_id 2,000,001 - 3,000,000
Shard 3: user_id 3,000,001 - 4,000,000

Pros: Range queries on the shard key are efficient (only hit relevant shards). Easy to understand and debug.

Cons: Uneven distribution if the key isn't uniformly distributed. New users all land on the newest shard, creating a hot spot.

3. Directory-Based Sharding

Maintain a lookup table that maps each shard key to its shard:

┌────────────┬───────────┐
│ tenant_id  │ shard     │
├────────────┼───────────┤
│ acme-corp  │ shard-2   │
│ globex     │ shard-1   │
│ initech    │ shard-3   │
│ umbrella   │ shard-2   │
└────────────┴───────────┘
# Pseudocode for directory-based routing
def get_shard(tenant_id: str) -> str:
    # Look up in the directory service (Redis, separate DB, etc.)
    return directory.get(tenant_id)

Pros: Maximum flexibility. You can move individual tenants between shards without any algorithmic change. Great for multi-tenant SaaS.

Cons: The directory is a single point of failure and a potential bottleneck. Every query requires a lookup before it can be routed.

Strategy Comparison

Strategy Distribution Range Queries Adding Shards Flexibility SPOF Risk
Hash-based Excellent Poor (scatter) Expensive (rehash) Low None
Range-based Risky (skew) Excellent Moderate Medium None
Directory-based Controllable Depends on mapping Easy (update table) High Directory itself

Interview Tip

In a system design interview, default to hash-based sharding unless the problem specifically involves range queries or multi-tenant isolation. Say something like: "I'll use hash-based sharding on user_id because it gives even distribution and our primary access pattern is per-user lookups." Then mention the resharding challenge and that you'd use consistent hashing (next lesson) to mitigate it.

Choosing the Right Shard Key

The shard key is the most important decision in a sharding architecture. Get it wrong, and you'll spend months fixing it. Here's what makes a good shard key:

High cardinality. The key must have enough distinct values to distribute across all shards. A boolean column (is_active) has cardinality 2 -- useless for 16 shards.

Even distribution. Values should spread roughly equally. If 80% of your data has country = 'US', sharding by country puts 80% of data on one shard.

Query alignment. Your most common queries should include the shard key, so they can be routed to a single shard. If you shard by user_id but most queries filter by order_date, every query becomes a cross-shard scatter.

Shard Key Cardinality Distribution Query Fit Verdict
user_id High Even (with hash) Great for per-user queries Good default
tenant_id Medium-High Depends on tenant sizes Perfect for multi-tenant SaaS Good for B2B
order_id High Even Only if queries are per-order Situational
is_premium 2 Terrible No Never
created_at High Skewed to recent Good for time-range reads Risky for writes
country ~200 Highly skewed Good if queries are per-country Risky without sub-sharding

The Time-Range Sharding Anti-Pattern

Don't Shard by Timestamp

Sharding by time range makes the newest shard a write hot spot — 100% of new writes hit one machine while historical shards sit idle. This turns your distributed system back into a single-writer bottleneck.

This one catches people. It seems logical: shard by month or year, since time-series data is often queried by time range.

The problem: all writes go to the newest shard.

January shard:  0 new writes, 30M existing rows (cold)
February shard: 0 new writes, 28M existing rows (cold)
March shard:    ALL new writes (hot!)

The March shard gets hammered while January and February sit idle. You've turned a distributed system into a single-writer bottleneck -- exactly what sharding was supposed to fix.

If you need time-based access patterns, shard by something else (user_id) and partition within each shard by time. You get the write distribution from sharding and the time-range pruning from partitioning.

Interview Tip

If an interviewer suggests sharding by timestamp, gently point out the hot-shard problem. Then propose the composite approach: "I'd shard by user_id for write distribution and partition within each shard by date range for efficient time-based queries." This shows you understand both tools and when to combine them.

Quick Recap

Concept What It Does Why It Matters
Sharding Distributes data across independent database instances Overcomes single-machine limits
Hash-based hash(key) % N determines the shard Even distribution, simple
Range-based Key ranges map to shards Efficient range queries, but skew risk
Directory-based Lookup table maps keys to shards Maximum flexibility, but adds a SPOF
Shard key The column used to determine which shard holds a row Most important decision in the architecture
High cardinality Key has many distinct values Required for even distribution
Time-range anti-pattern Sharding by timestamp creates write hot spots Shard by entity, partition by time instead