Sharding

TL;DR

Sharding distributes data across multiple independent database instances so that no single machine becomes a bottleneck for storage, writes, or read latency.

Ten Warehouses Instead of One Giant One

Imagine you run a shipping company. You started with one warehouse, and it worked great. But now you're shipping to 50 states, and that single warehouse is maxed out -- no more shelf space, the loading docks are jammed, and every shipment routes through the same bottleneck.

The solution isn't a bigger warehouse (there's a physical limit). It's ten smaller warehouses, each handling a region. West Coast orders go to the Portland warehouse. Southeast orders go to Atlanta. Each warehouse operates independently, has its own staff, and can be scaled on its own.

That's sharding. Instead of one massive database, you split the data across multiple independent database instances. Each instance -- called a shard -- holds a subset of the total data and handles reads and writes for that subset.

Why Shard at All?

Partitioning (from the previous lesson) splits data on a single machine. Sharding splits it across machines. You reach for sharding when a single machine hits hard limits:

Limit	Typical Threshold	What Happens
Storage	~256 TB (cloud instances)	Can't fit more data on one node
Write throughput	~50K-100K writes/sec (PostgreSQL)	Single writer bottleneck, WAL becomes the constraint
Read latency	Depends on dataset/memory ratio	Working set exceeds RAM, queries hit disk constantly
Connection count	~5,000-10,000 connections	App servers competing for limited connection slots

If your traffic is read-heavy, you can often delay sharding by adding read replicas. But write-heavy workloads eventually force the issue -- there's only one primary, and it can only do so much.

Companies that shard: Instagram sharded PostgreSQL early because user growth outpaced single-instance capacity. Notion sharded when their single PostgreSQL instance hit performance walls. Figma, Slack, Pinterest -- the pattern repeats.

Three Sharding Strategies

1. Hash-Based Sharding (Most Common)

Apply a hash function to the shard key and use modulo to pick the shard:

shard_number = hash(shard_key) % number_of_shards

import hashlib

def get_shard(user_id: int, num_shards: int) -> int:
    key_bytes = str(user_id).encode('utf-8')
    hash_value = int(hashlib.md5(key_bytes).hexdigest(), 16)
    return hash_value % num_shards

# User 12345 -> shard 2, User 67890 -> shard 0, etc.
print(get_shard(12345, 4))  # Deterministic: same input always gives same shard

A shard-aware query router is just a few extra lines:

# Shard-aware query router
SHARD_CONNECTIONS = {
    0: db_connect("shard-0.internal"),
    1: db_connect("shard-1.internal"),
    2: db_connect("shard-2.internal"),
    3: db_connect("shard-3.internal"),
}

def query_user(user_id):
    shard = get_shard(user_id, num_shards=4)       # deterministic routing
    conn = SHARD_CONNECTIONS[shard]
    return conn.execute("SELECT * FROM users WHERE id = %s", user_id)

That's the entire sharding proxy concept — hash the key, pick the connection, forward the query.

Pros: Distributes data evenly regardless of key patterns. Simple to implement.

Cons: Adding or removing shards changes the modulo, causing massive data migration. Range queries across shards are expensive.

2. Range-Based Sharding

Assign contiguous ranges of the shard key to each shard:

Shard 0: user_id     1 - 1,000,000
Shard 1: user_id 1,000,001 - 2,000,000
Shard 2: user_id 2,000,001 - 3,000,000
Shard 3: user_id 3,000,001 - 4,000,000

Pros: Range queries on the shard key are efficient (only hit relevant shards). Easy to understand and debug.

Cons: Uneven distribution if the key isn't uniformly distributed. New users all land on the newest shard, creating a hot spot.

3. Directory-Based Sharding

Maintain a lookup table that maps each shard key to its shard:

┌────────────┬───────────┐
│ tenant_id  │ shard     │
├────────────┼───────────┤
│ acme-corp  │ shard-2   │
│ globex     │ shard-1   │
│ initech    │ shard-3   │
│ umbrella   │ shard-2   │
└────────────┴───────────┘

# Pseudocode for directory-based routing
def get_shard(tenant_id: str) -> str:
    # Look up in the directory service (Redis, separate DB, etc.)
    return directory.get(tenant_id)

Pros: Maximum flexibility. You can move individual tenants between shards without any algorithmic change. Great for multi-tenant SaaS.

Cons: The directory is a single point of failure and a potential bottleneck. Every query requires a lookup before it can be routed.

Strategy Comparison

Strategy	Distribution	Range Queries	Adding Shards	Flexibility	SPOF Risk
Hash-based	Excellent	Poor (scatter)	Expensive (rehash)	Low	None
Range-based	Risky (skew)	Excellent	Moderate	Medium	None
Directory-based	Controllable	Depends on mapping	Easy (update table)	High	Directory itself

Interview Tip

In a system design interview, default to hash-based sharding unless the problem specifically involves range queries or multi-tenant isolation. Say something like: "I'll use hash-based sharding on user_id because it gives even distribution and our primary access pattern is per-user lookups." Then mention the resharding challenge and that you'd use consistent hashing (next lesson) to mitigate it.

Choosing the Right Shard Key

The shard key is the most important decision in a sharding architecture. Get it wrong, and you'll spend months fixing it. Here's what makes a good shard key:

High cardinality. The key must have enough distinct values to distribute across all shards. A boolean column (is_active) has cardinality 2 -- useless for 16 shards.

Even distribution. Values should spread roughly equally. If 80% of your data has country = 'US', sharding by country puts 80% of data on one shard.

Query alignment. Your most common queries should include the shard key, so they can be routed to a single shard. If you shard by user_id but most queries filter by order_date, every query becomes a cross-shard scatter.

Shard Key	Cardinality	Distribution	Query Fit	Verdict
`user_id`	High	Even (with hash)	Great for per-user queries	Good default
`tenant_id`	Medium-High	Depends on tenant sizes	Perfect for multi-tenant SaaS	Good for B2B
`order_id`	High	Even	Only if queries are per-order	Situational
`is_premium`	2	Terrible	No	Never
`created_at`	High	Skewed to recent	Good for time-range reads	Risky for writes
`country`	~200	Highly skewed	Good if queries are per-country	Risky without sub-sharding

The Time-Range Sharding Anti-Pattern

Don't Shard by Timestamp

Sharding by time range makes the newest shard a write hot spot — 100% of new writes hit one machine while historical shards sit idle. This turns your distributed system back into a single-writer bottleneck.

This one catches people. It seems logical: shard by month or year, since time-series data is often queried by time range.

The problem: all writes go to the newest shard.

January shard:  0 new writes, 30M existing rows (cold)
February shard: 0 new writes, 28M existing rows (cold)
March shard:    ALL new writes (hot!)

The March shard gets hammered while January and February sit idle. You've turned a distributed system into a single-writer bottleneck -- exactly what sharding was supposed to fix.

If you need time-based access patterns, shard by something else (user_id) and partition within each shard by time. You get the write distribution from sharding and the time-range pruning from partitioning.

Interview Tip

If an interviewer suggests sharding by timestamp, gently point out the hot-shard problem. Then propose the composite approach: "I'd shard by user_id for write distribution and partition within each shard by date range for efficient time-based queries." This shows you understand both tools and when to combine them.

Quick Recap

Concept	What It Does	Why It Matters
Sharding	Distributes data across independent database instances	Overcomes single-machine limits
Hash-based	`hash(key) % N` determines the shard	Even distribution, simple
Range-based	Key ranges map to shards	Efficient range queries, but skew risk
Directory-based	Lookup table maps keys to shards	Maximum flexibility, but adds a SPOF
Shard key	The column used to determine which shard holds a row	Most important decision in the architecture
High cardinality	Key has many distinct values	Required for even distribution
Time-range anti-pattern	Sharding by timestamp creates write hot spots	Shard by entity, partition by time instead