Sharding, Replica Sets, and Consistency
TL;DR
Replica sets give you high availability and read scaling; sharding gives you write scaling and storage scaling -- but pick the wrong shard key and you'll spend months recovering from a decision that took five minutes.
What It Is

MongoDB has two independent scaling mechanisms, and conflating them is a common interview mistake.
Replica sets copy the same data to multiple nodes. You get fault tolerance and read scaling. Every node has the full dataset. If the primary dies, a secondary gets elected as the new primary. Your application barely notices.
Sharding splits data across multiple nodes. Each node holds a subset of the data. You get write scaling and storage beyond what a single machine can handle. But the data is partitioned, and queries that span partitions are expensive.
These are orthogonal. In production, each shard is itself a replica set. You get both horizontal scaling and high availability.
Replica Sets: The Foundation
A replica set has one primary and one or more secondaries. The minimum for production is three members -- primary plus two secondaries.
How Replication Works
- Client writes to the primary.
- The primary records the write in its oplog (operations log) -- a capped collection in
local.oplog.rs. - Secondaries tail the oplog and apply operations in order.
- Each secondary maintains its own copy of the data.
The oplog is a capped collection with a fixed size. If a secondary falls too far behind (the gap exceeds the oplog window), it needs a full resync. MongoDB Atlas auto-sizes the oplog. Self-hosted deployments need manual sizing.
Elections and Failover
When the primary goes down, the remaining members hold an election. MongoDB uses a Raft-like consensus protocol (not exactly Raft, but close enough for interview purposes).
Election rules:
- A member must receive votes from a majority to become primary.
- Members with the most recent oplog entry are preferred.
- Priority settings let you influence which member gets elected (higher priority = more likely to become primary).
- A three-member set tolerates one failure. A five-member set tolerates two.
- Elections typically complete in 10-12 seconds. During this window, the replica set is read-only.
Spicy take: If your application can't tolerate 12 seconds of write unavailability during failover, MongoDB isn't your primary database. It's your read replica behind a relational system.
Arbiter Nodes
An arbiter votes in elections but holds no data. It exists to break ties in two-member sets. Use them sparingly -- a three-member set with two data nodes and one arbiter gives you only one copy of your data if a node dies. That's risky.
Read Preferences
Where you read from is a trade-off between consistency and performance.
| Read Preference | Reads From | Consistency | Latency | Use Case |
|---|---|---|---|---|
primary |
Primary only | Strong | Highest | Default. Financial data. |
primaryPreferred |
Primary, fallback to secondary | Strong (usually) | Medium | When primary is overloaded |
secondary |
Secondaries only | Eventual | Lower | Analytics, reporting |
secondaryPreferred |
Secondary, fallback to primary | Eventual (usually) | Lower | Read-heavy workloads |
nearest |
Lowest latency member | Varies | Lowest | Geo-distributed clusters |
The Replication Lag Problem
Secondaries are always slightly behind the primary. The lag is usually milliseconds, but under heavy write load it can grow to seconds. This means:
- User writes a comment, reads the page -- comment is missing. Classic stale read.
secondaryreads may return outdated data.nearestreads might hit a secondary that's 500ms behind.
Fix: use primary reads for anything the user just wrote. Use secondary reads for dashboard analytics and reports that tolerate staleness.
Netflix does this for their viewing history. Writes go to primary. The "Continue Watching" row reads from secondaries because a 2-second lag on your watch history doesn't matter.
Write Concern
Write concern controls how many acknowledgments you wait for before the write is considered successful.
db.orders.insertOne(
{ item: "widget", qty: 100 },
{ writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)
| Write Concern | What It Means | Durability | Latency |
|---|---|---|---|
w: 0 |
Fire and forget. Don't wait. | None | Lowest |
w: 1 |
Primary acknowledged | Single node | Low |
w: "majority" |
Majority of replica set acknowledged | Survives primary failure | Medium |
w: <number> |
That many members acknowledged | Custom | Varies |
j: true |
Wait for journal flush on acknowledging nodes | Survives process crash | Higher |
The w: 1 Trap
The default is w: 1 -- the primary acknowledges the write. If the primary dies before replicating to a secondary, that write is lost. Gone. The new primary doesn't have it.
For anything important, use w: "majority". It waits for the write to replicate to a majority of nodes before returning. Slower, but your data survives a primary failure.
wtimeout
If a secondary is down and you're using w: "majority" on a three-member set, only two of three are reachable. Majority of three is two, so writes still succeed. But if two secondaries are down, writes block forever unless you set wtimeout.
Always set wtimeout in production. 5000ms is a reasonable default.
Read Concern
Read concern controls the consistency level of data returned by queries.
| Read Concern | What You Get | Use Case |
|---|---|---|
local |
Latest data on the queried node (may be rolled back) | Default. Fast reads. |
available |
Same as local for non-sharded. For sharded: may return orphaned docs during migration | Avoid this one |
majority |
Data acknowledged by majority (won't be rolled back) | Consistent reads after majority writes |
linearizable |
Read reflects all majority writes completed before the read started | Strongest. Slowest. |
snapshot |
Read from a consistent snapshot (for multi-doc transactions) | Transactions |
The Read-Write Concern Matrix
This is where interviews get tricky. The interaction between read and write concern determines your actual consistency guarantee.
| Write Concern | Read Concern | Guarantee |
|---|---|---|
w: 1 |
local |
Fast but can lose data on failover |
w: "majority" |
local |
Durable writes, but reads might see rolled-back data |
w: "majority" |
majority |
Durable writes + consistent reads. This is the safe default. |
w: "majority" |
linearizable |
Causal consistency. Slowest but strongest. |
Interview shortcut: "We use w: majority with read concern majority for critical paths and w: 1 with local reads for best-effort data like logs."
Sharding: Horizontal Scaling
When your dataset exceeds what one replica set can hold, or your write throughput exceeds what one primary can handle, you shard.
Sharding Architecture
Client
|
mongos (router) ---- Config Servers (3-member replica set)
| Stores: chunk ranges, shard metadata
|
+---- Shard 1 (replica set) [chunks A-M]
+---- Shard 2 (replica set) [chunks N-Z]
+---- Shard 3 (replica set) [chunks overflow]
mongos: Stateless query router. Your application connects to mongos, not to individual shards. Run multiple mongos instances behind a load balancer for high availability.
Config servers: A three-member replica set that stores cluster metadata -- which chunks live on which shards, chunk ranges, and the shard key definition. If config servers go down, the cluster is read-only.
Shards: Each shard is a replica set holding a subset of the data.
How Data Is Distributed
MongoDB splits each collection into chunks (default 128MB). Each chunk covers a range of shard key values. Chunks are distributed across shards.
When a chunk exceeds the size limit, MongoDB splits it into two smaller chunks. When one shard has significantly more chunks than another, the balancer migrates chunks between shards.
Shard Key Selection
This is the single most consequential decision you'll make with MongoDB. A bad shard key cannot be changed without migrating the entire collection. Choose wrong, and you'll spend months fixing it.
What Makes a Good Shard Key
A good shard key has three properties:
- High cardinality -- many distinct values so data distributes evenly.
- Even distribution -- values are roughly equally frequent.
- Query isolation -- common queries include the shard key, so mongos routes to one shard instead of broadcasting.
Shard Key Anti-Patterns
Monotonically Increasing Key (ObjectId, timestamp)
Shard 1: [min, 2024-01-01) -- old data, barely touched
Shard 2: [2024-01-01, 2024-06-01) -- recent data, some traffic
Shard 3: [2024-06-01, max) -- ALL current writes go here (HOT SHARD)
All new inserts go to the shard owning the maximum range. You get one overloaded shard and N-1 idle shards. This defeats the entire purpose of sharding.
Low Cardinality Key (country, status)
If you shard by country and 60% of your users are in the US, 60% of your data sits on one shard. You can't split a chunk where every document has the same shard key value.
Completely Random Key (random UUID)
Even distribution, but zero query isolation. Every query hits every shard. Your mongos broadcasts all reads. Throughput drops.
Good Shard Key Examples
| Shard Key | Cardinality | Distribution | Query Isolation | Best For |
|---|---|---|---|---|
hashed _id |
High | Even | None (scatter-gather) | Write-heavy, no range queries |
{ tenant_id, _id } |
High | Even per tenant | Yes -- queries scoped to tenant | Multi-tenant SaaS |
{ user_id, timestamp } |
High | Even | Yes -- queries scoped to user | User activity logs |
{ region, order_id } |
Medium-High | Depends on region mix | Regional queries | Geo-distributed orders |
The compound shard key trick: { tenant_id: 1, _id: 1 } gives you query isolation (all queries include tenant_id) AND high cardinality (tenant_id + _id is always unique).
Stripe uses a similar pattern. All data for a merchant lives on the same shard, so queries scoped to a single merchant are fast. Cross-merchant analytics queries scatter-gather, but those run offline anyway.
The Balancer
The balancer is a background process that migrates chunks between shards to keep the distribution even.
How It Works
- Balancer checks chunk counts per shard every few seconds.
- If the difference exceeds a threshold (default: 2 chunks for < 20 chunks, 8 for 20-79 chunks), it starts migrating.
- Migration: source shard copies chunk data to destination shard, then updates config servers.
Balancer Problems
Latency spikes during migration. Moving a 128MB chunk means reading from source, writing to destination, and updating metadata. During migration, queries for the migrating chunk may be slower.
Jumbo chunks. If every document in a chunk has the same shard key value, the chunk can't be split. It sits on one shard and grows without bound. This is the consequence of a low-cardinality shard key.
Balancer windows. In production, restrict the balancer to off-peak hours:
db.settings.updateOne(
{ _id: "balancer" },
{
$set: {
activeWindow: {
start: "02:00",
stop: "06:00"
}
}
},
{ upsert: true }
)
Patterns for System Design Interviews
Pattern 1: Multi-Tenant SaaS
"We shard by { tenant_id: 1, _id: 1 }. Each tenant's data is co-located on the same shard. API queries always include tenant_id, so mongos routes to a single shard. Large tenants may dominate a shard, but we can move their data to a dedicated shard using zone sharding."
Pattern 2: Time-Series with Sharding
"We shard by { sensor_id: 1, timestamp: 1 }. Queries always include sensor_id (which sensor's data?) and a timestamp range (last 24 hours?). This avoids hot shards because different sensors hash to different shards."
Pattern 3: Global User Store
"Three-member replica set in each region using Atlas Global Clusters. Users in EU read from EU secondaries. Writes go to the primary in US-East. Read preference nearest for latency. Write concern majority for durability."
Anti-Pattern to Call Out
"We would NOT shard by timestamp alone because all current writes would go to a single shard. We'd also avoid sharding prematurely -- a single replica set handles most workloads up to a few TB and 50K writes/sec."

Trade-Offs Table
| Factor | Replica Sets Only | Sharded Cluster |
|---|---|---|
| Write scaling | Limited to one primary | Writes distributed across shards |
| Read scaling | Read from secondaries | Read from any shard + secondaries |
| Storage limit | Single machine disk | Sum of all shard disks |
| Operational complexity | Low (3 nodes) | High (mongos + config + N shards) |
| Query latency | Predictable | Depends on shard key (targeted vs scatter) |
| Failover | Automatic (10-12 sec) | Per-shard failover (each shard is a replica set) |
| Cost | 3 nodes minimum | 3 shards x 3 replicas = 9 nodes minimum |
| Shard key change | N/A | Requires full collection migration |
| Cross-shard queries | N/A | Scatter-gather, slow |
| Transactions | Full support | Limited -- multi-shard transactions are slower |
Interview Gotchas
"When should you shard?"
Not until you need to. A single replica set handles most workloads. Shard when: (1) data exceeds single-node disk, (2) write throughput exceeds single-primary capacity, or (3) you need to isolate tenants for compliance. Premature sharding adds complexity for zero benefit.
"Can you change the shard key?"
Since MongoDB 5.0, you can refine a shard key by adding a suffix field. You cannot change the existing fields. Full shard key changes require creating a new collection and migrating data. This takes hours or days for large collections.
"What happens during a failover?"
The replica set holds an election. During the ~12-second election window, writes fail. Reads from secondaries continue if read preference allows it. Applications using w: majority will get write errors until the new primary is elected. Your retry logic needs to handle this.
"How does MongoDB compare to Cassandra for write scaling?"
Both shard writes across nodes. Cassandra uses consistent hashing (no balancer, no jumbo chunks). MongoDB uses range-based sharding with a balancer. Cassandra is leaderless (any node accepts writes). MongoDB is leader-based (writes go to shard primary). Cassandra gives you better write throughput at the cost of weaker consistency.
"What's the minimum cluster size for sharding?"
Two shards (each a 3-member replica set) + 3 config servers + at least 1 mongos = 10 nodes minimum. In practice, you'd run mongos on application servers, but that's still 9 dedicated database nodes. Don't shard a collection with 10GB of data.
Summary
Replica sets are table stakes -- every production MongoDB deployment uses them. They give you high availability, automatic failover, and read scaling. The decisions are straightforward: three members, w: majority, read from secondaries for analytics.
Sharding is a one-way door. The shard key decision is irreversible (practically speaking). Get it right by optimizing for your most common query pattern, ensuring high cardinality, and avoiding monotonically increasing values. When an interviewer asks about MongoDB scaling, lead with the shard key discussion. It's where the real engineering happens.