Sharding, Replica Sets, and Consistency

TL;DR

Replica sets give you high availability and read scaling; sharding gives you write scaling and storage scaling -- but pick the wrong shard key and you'll spend months recovering from a decision that took five minutes.

What It Is

Replica Set

MongoDB has two independent scaling mechanisms, and conflating them is a common interview mistake.

Replica sets copy the same data to multiple nodes. You get fault tolerance and read scaling. Every node has the full dataset. If the primary dies, a secondary gets elected as the new primary. Your application barely notices.

Sharding splits data across multiple nodes. Each node holds a subset of the data. You get write scaling and storage beyond what a single machine can handle. But the data is partitioned, and queries that span partitions are expensive.

These are orthogonal. In production, each shard is itself a replica set. You get both horizontal scaling and high availability.

Replica Sets: The Foundation

A replica set has one primary and one or more secondaries. The minimum for production is three members -- primary plus two secondaries.

How Replication Works

Client writes to the primary.
The primary records the write in its oplog (operations log) -- a capped collection in local.oplog.rs.
Secondaries tail the oplog and apply operations in order.
Each secondary maintains its own copy of the data.

The oplog is a capped collection with a fixed size. If a secondary falls too far behind (the gap exceeds the oplog window), it needs a full resync. MongoDB Atlas auto-sizes the oplog. Self-hosted deployments need manual sizing.

Elections and Failover

When the primary goes down, the remaining members hold an election. MongoDB uses a Raft-like consensus protocol (not exactly Raft, but close enough for interview purposes).

Election rules:

A member must receive votes from a majority to become primary.
Members with the most recent oplog entry are preferred.
Priority settings let you influence which member gets elected (higher priority = more likely to become primary).
A three-member set tolerates one failure. A five-member set tolerates two.
Elections typically complete in 10-12 seconds. During this window, the replica set is read-only.

Spicy take: If your application can't tolerate 12 seconds of write unavailability during failover, MongoDB isn't your primary database. It's your read replica behind a relational system.

Arbiter Nodes

An arbiter votes in elections but holds no data. It exists to break ties in two-member sets. Use them sparingly -- a three-member set with two data nodes and one arbiter gives you only one copy of your data if a node dies. That's risky.

Read Preferences

Where you read from is a trade-off between consistency and performance.

Read Preference	Reads From	Consistency	Latency	Use Case
`primary`	Primary only	Strong	Highest	Default. Financial data.
`primaryPreferred`	Primary, fallback to secondary	Strong (usually)	Medium	When primary is overloaded
`secondary`	Secondaries only	Eventual	Lower	Analytics, reporting
`secondaryPreferred`	Secondary, fallback to primary	Eventual (usually)	Lower	Read-heavy workloads
`nearest`	Lowest latency member	Varies	Lowest	Geo-distributed clusters

The Replication Lag Problem

Secondaries are always slightly behind the primary. The lag is usually milliseconds, but under heavy write load it can grow to seconds. This means:

User writes a comment, reads the page -- comment is missing. Classic stale read.
secondary reads may return outdated data.
nearest reads might hit a secondary that's 500ms behind.

Fix: use primary reads for anything the user just wrote. Use secondary reads for dashboard analytics and reports that tolerate staleness.

Netflix does this for their viewing history. Writes go to primary. The "Continue Watching" row reads from secondaries because a 2-second lag on your watch history doesn't matter.

Write Concern

Write concern controls how many acknowledgments you wait for before the write is considered successful.

db.orders.insertOne(
  { item: "widget", qty: 100 },
  { writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
)

Write Concern	What It Means	Durability	Latency
`w: 0`	Fire and forget. Don't wait.	None	Lowest
`w: 1`	Primary acknowledged	Single node	Low
`w: "majority"`	Majority of replica set acknowledged	Survives primary failure	Medium
`w: <number>`	That many members acknowledged	Custom	Varies
`j: true`	Wait for journal flush on acknowledging nodes	Survives process crash	Higher

The `w: 1` Trap

The default is w: 1 -- the primary acknowledges the write. If the primary dies before replicating to a secondary, that write is lost. Gone. The new primary doesn't have it.

For anything important, use w: "majority". It waits for the write to replicate to a majority of nodes before returning. Slower, but your data survives a primary failure.

wtimeout

If a secondary is down and you're using w: "majority" on a three-member set, only two of three are reachable. Majority of three is two, so writes still succeed. But if two secondaries are down, writes block forever unless you set wtimeout.

Always set wtimeout in production. 5000ms is a reasonable default.

Read Concern

Read concern controls the consistency level of data returned by queries.

Read Concern	What You Get	Use Case
`local`	Latest data on the queried node (may be rolled back)	Default. Fast reads.
`available`	Same as local for non-sharded. For sharded: may return orphaned docs during migration	Avoid this one
`majority`	Data acknowledged by majority (won't be rolled back)	Consistent reads after majority writes
`linearizable`	Read reflects all majority writes completed before the read started	Strongest. Slowest.
`snapshot`	Read from a consistent snapshot (for multi-doc transactions)	Transactions

The Read-Write Concern Matrix

This is where interviews get tricky. The interaction between read and write concern determines your actual consistency guarantee.

Write Concern	Read Concern	Guarantee
`w: 1`	`local`	Fast but can lose data on failover
`w: "majority"`	`local`	Durable writes, but reads might see rolled-back data
`w: "majority"`	`majority`	Durable writes + consistent reads. This is the safe default.
`w: "majority"`	`linearizable`	Causal consistency. Slowest but strongest.

Interview shortcut: "We use w: majority with read concern majority for critical paths and w: 1 with local reads for best-effort data like logs."

Sharding: Horizontal Scaling

When your dataset exceeds what one replica set can hold, or your write throughput exceeds what one primary can handle, you shard.

Sharding Architecture

Client
  |
mongos (router) ---- Config Servers (3-member replica set)
  |                    Stores: chunk ranges, shard metadata
  |
  +---- Shard 1 (replica set)  [chunks A-M]
  +---- Shard 2 (replica set)  [chunks N-Z]
  +---- Shard 3 (replica set)  [chunks overflow]

mongos: Stateless query router. Your application connects to mongos, not to individual shards. Run multiple mongos instances behind a load balancer for high availability.

Config servers: A three-member replica set that stores cluster metadata -- which chunks live on which shards, chunk ranges, and the shard key definition. If config servers go down, the cluster is read-only.

Shards: Each shard is a replica set holding a subset of the data.

How Data Is Distributed

MongoDB splits each collection into chunks (default 128MB). Each chunk covers a range of shard key values. Chunks are distributed across shards.

When a chunk exceeds the size limit, MongoDB splits it into two smaller chunks. When one shard has significantly more chunks than another, the balancer migrates chunks between shards.

Shard Key Selection

This is the single most consequential decision you'll make with MongoDB. A bad shard key cannot be changed without migrating the entire collection. Choose wrong, and you'll spend months fixing it.

What Makes a Good Shard Key

A good shard key has three properties:

High cardinality -- many distinct values so data distributes evenly.
Even distribution -- values are roughly equally frequent.
Query isolation -- common queries include the shard key, so mongos routes to one shard instead of broadcasting.

Shard Key Anti-Patterns

Monotonically Increasing Key (ObjectId, timestamp)

Shard 1: [min, 2024-01-01)    -- old data, barely touched
Shard 2: [2024-01-01, 2024-06-01)  -- recent data, some traffic
Shard 3: [2024-06-01, max)    -- ALL current writes go here (HOT SHARD)

All new inserts go to the shard owning the maximum range. You get one overloaded shard and N-1 idle shards. This defeats the entire purpose of sharding.

Low Cardinality Key (country, status)

If you shard by country and 60% of your users are in the US, 60% of your data sits on one shard. You can't split a chunk where every document has the same shard key value.

Completely Random Key (random UUID)

Even distribution, but zero query isolation. Every query hits every shard. Your mongos broadcasts all reads. Throughput drops.

Good Shard Key Examples

Shard Key	Cardinality	Distribution	Query Isolation	Best For
`hashed _id`	High	Even	None (scatter-gather)	Write-heavy, no range queries
`{ tenant_id, _id }`	High	Even per tenant	Yes -- queries scoped to tenant	Multi-tenant SaaS
`{ user_id, timestamp }`	High	Even	Yes -- queries scoped to user	User activity logs
`{ region, order_id }`	Medium-High	Depends on region mix	Regional queries	Geo-distributed orders

The compound shard key trick: { tenant_id: 1, _id: 1 } gives you query isolation (all queries include tenant_id) AND high cardinality (tenant_id + _id is always unique).

Stripe uses a similar pattern. All data for a merchant lives on the same shard, so queries scoped to a single merchant are fast. Cross-merchant analytics queries scatter-gather, but those run offline anyway.

The Balancer

The balancer is a background process that migrates chunks between shards to keep the distribution even.

How It Works

Balancer checks chunk counts per shard every few seconds.
If the difference exceeds a threshold (default: 2 chunks for < 20 chunks, 8 for 20-79 chunks), it starts migrating.
Migration: source shard copies chunk data to destination shard, then updates config servers.

Balancer Problems

Latency spikes during migration. Moving a 128MB chunk means reading from source, writing to destination, and updating metadata. During migration, queries for the migrating chunk may be slower.

Jumbo chunks. If every document in a chunk has the same shard key value, the chunk can't be split. It sits on one shard and grows without bound. This is the consequence of a low-cardinality shard key.

Balancer windows. In production, restrict the balancer to off-peak hours:

db.settings.updateOne(
  { _id: "balancer" },
  {
    $set: {
      activeWindow: {
        start: "02:00",
        stop: "06:00"
      }
    }
  },
  { upsert: true }
)

Patterns for System Design Interviews

Pattern 1: Multi-Tenant SaaS

"We shard by { tenant_id: 1, _id: 1 }. Each tenant's data is co-located on the same shard. API queries always include tenant_id, so mongos routes to a single shard. Large tenants may dominate a shard, but we can move their data to a dedicated shard using zone sharding."

Pattern 2: Time-Series with Sharding

"We shard by { sensor_id: 1, timestamp: 1 }. Queries always include sensor_id (which sensor's data?) and a timestamp range (last 24 hours?). This avoids hot shards because different sensors hash to different shards."

Pattern 3: Global User Store

"Three-member replica set in each region using Atlas Global Clusters. Users in EU read from EU secondaries. Writes go to the primary in US-East. Read preference nearest for latency. Write concern majority for durability."

Anti-Pattern to Call Out

"We would NOT shard by timestamp alone because all current writes would go to a single shard. We'd also avoid sharding prematurely -- a single replica set handles most workloads up to a few TB and 50K writes/sec."

Shard Key Selection

Trade-Offs Table

Factor	Replica Sets Only	Sharded Cluster
Write scaling	Limited to one primary	Writes distributed across shards
Read scaling	Read from secondaries	Read from any shard + secondaries
Storage limit	Single machine disk	Sum of all shard disks
Operational complexity	Low (3 nodes)	High (mongos + config + N shards)
Query latency	Predictable	Depends on shard key (targeted vs scatter)
Failover	Automatic (10-12 sec)	Per-shard failover (each shard is a replica set)
Cost	3 nodes minimum	3 shards x 3 replicas = 9 nodes minimum
Shard key change	N/A	Requires full collection migration
Cross-shard queries	N/A	Scatter-gather, slow
Transactions	Full support	Limited -- multi-shard transactions are slower

Interview Gotchas

"When should you shard?"

Not until you need to. A single replica set handles most workloads. Shard when: (1) data exceeds single-node disk, (2) write throughput exceeds single-primary capacity, or (3) you need to isolate tenants for compliance. Premature sharding adds complexity for zero benefit.

"Can you change the shard key?"

Since MongoDB 5.0, you can refine a shard key by adding a suffix field. You cannot change the existing fields. Full shard key changes require creating a new collection and migrating data. This takes hours or days for large collections.

"What happens during a failover?"

The replica set holds an election. During the ~12-second election window, writes fail. Reads from secondaries continue if read preference allows it. Applications using w: majority will get write errors until the new primary is elected. Your retry logic needs to handle this.

"How does MongoDB compare to Cassandra for write scaling?"

Both shard writes across nodes. Cassandra uses consistent hashing (no balancer, no jumbo chunks). MongoDB uses range-based sharding with a balancer. Cassandra is leaderless (any node accepts writes). MongoDB is leader-based (writes go to shard primary). Cassandra gives you better write throughput at the cost of weaker consistency.

"What's the minimum cluster size for sharding?"

Two shards (each a 3-member replica set) + 3 config servers + at least 1 mongos = 10 nodes minimum. In practice, you'd run mongos on application servers, but that's still 9 dedicated database nodes. Don't shard a collection with 10GB of data.

Summary

Replica sets are table stakes -- every production MongoDB deployment uses them. They give you high availability, automatic failover, and read scaling. The decisions are straightforward: three members, w: majority, read from secondaries for analytics.

Sharding is a one-way door. The shard key decision is irreversible (practically speaking). Get it right by optimizing for your most common query pattern, ensuring high cardinality, and avoiding monotonically increasing values. When an interviewer asks about MongoDB scaling, lead with the shard key discussion. It's where the real engineering happens.

Sharding, Replica Sets, and Consistency

What It Is

Replica Sets: The Foundation

How Replication Works

Elections and Failover

Arbiter Nodes

Read Preferences

The Replication Lag Problem

Write Concern

The w: 1 Trap

wtimeout

Read Concern

The Read-Write Concern Matrix

Sharding: Horizontal Scaling

Sharding Architecture

How Data Is Distributed

Shard Key Selection

What Makes a Good Shard Key

Shard Key Anti-Patterns

Monotonically Increasing Key (ObjectId, timestamp)

Low Cardinality Key (country, status)

Completely Random Key (random UUID)

Good Shard Key Examples

The Balancer

How It Works

Balancer Problems

Patterns for System Design Interviews

Pattern 1: Multi-Tenant SaaS

Pattern 2: Time-Series with Sharding

Pattern 3: Global User Store

Anti-Pattern to Call Out

Trade-Offs Table

Interview Gotchas

"When should you shard?"

"Can you change the shard key?"

"What happens during a failover?"

"How does MongoDB compare to Cassandra for write scaling?"

"What's the minimum cluster size for sharding?"

Summary

The `w: 1` Trap