Sharding, Replicas, and Near-Real-Time Search

TL;DR

An Elasticsearch shard is just a Lucene index on a single node — scaling means distributing those shards across nodes, and the number one operational mistake teams make is choosing the wrong shard count at index creation because you can't change it later without a full reindex.

What It Is

ILM Tiers

A single Lucene index has limits. It runs on one machine. One CPU, one disk, one set of memory. When your data exceeds what one machine can hold — or your query load exceeds what one machine can serve — you need to split.

Elasticsearch solves this with shards. Each shard is an independent Lucene index living on one node. An ES "index" (like products or logs-2024-04) is really a collection of shards spread across your cluster.

Wikipedia runs its search on Elasticsearch with hundreds of shards across dozens of nodes. GitHub originally used ES for code search across 100+ million repositories, before building their custom Blackbird engine in 2023. Neither could work on a single Lucene instance. Sharding makes it possible.

This lesson covers how sharding works, how replicas provide redundancy, and how the query execution model creates trade-offs you need to know for interviews.

When to Use Sharding vs Alternatives

Scenario	Approach	Why
< 5GB of data	Single shard (1P, 1R)	Overhead of multiple shards not worth it
5GB - 50GB	1-5 primary shards	Each shard 10-50GB
50GB - 500GB	5-20 primary shards	Distribute across 3-10 nodes
500GB+	Time-based indexes + ILM	Hot-warm-cold architecture
Read-heavy, write-light	More replicas	Replicas serve read traffic
Write-heavy (logs, metrics)	More primary shards	Parallelizes writes

Here's a strong opinion: most teams over-shard. They create 20 primary shards for an index that holds 5GB. Each shard carries memory overhead — segment metadata, field data caches, thread pools. Twenty shards at 250MB each is worse than two shards at 2.5GB each. The overhead eats more resources than the data.

Internals — How Sharding Works

Primary and Replica Shards

Index: "products" — 3 primary shards, 1 replica each

Node 1              Node 2              Node 3
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  P0          │    │  P1          │    │  P2          │
│  R2          │    │  R0          │    │  R1          │
└─────────────┘    └─────────────┘    └─────────────┘

P0, P1, P2 = primary shards (handle writes)
R0, R1, R2 = replica shards (copies on different nodes)

Primary shards handle all write operations. When you index a document, ES routes it to a primary shard based on hash(document_id) % num_primary_shards. The primary shard indexes the document and then forwards it to its replicas.

Replica shards are exact copies of their primary. They serve two purposes: fault tolerance (if a node dies, the replica gets promoted to primary) and read scaling (queries can hit replicas, spreading load across nodes).

The number of primary shards is set at index creation. You cannot change it later. This is the most important operational decision you make when creating an index. Replicas, on the other hand, can be adjusted at any time.

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Gotcha

Changing the number of primary shards requires creating a new index with the desired shard count, then reindexing all data. For a 500GB index, this can take hours. Plan your shard count carefully from the start.

Document Routing

How does ES decide which shard holds a document?

shard_number = hash(routing_key) % number_of_primary_shards

By default, routing_key is the document's _id. You can override it with a custom routing value.

Custom routing is powerful but dangerous. If you route all documents for tenant_id: acme to the same shard, queries filtered by that tenant only hit one shard instead of all shards. Faster queries. But if one tenant has 10x the data of others, you get a hot shard.

# Custom routing — all docs for tenant "acme" go to the same shard
PUT /products/_doc/123?routing=acme
{
  "tenant_id": "acme",
  "name": "Widget"
}

# Query only hits the shard with "acme" data
GET /products/_search?routing=acme
{
  "query": { "match": { "name": "widget" } }
}

Slack uses custom routing by workspace. Each workspace's messages live on the same shard. This makes per-workspace searches fast because only one shard is queried. The downside? Large workspaces like Salesforce's or IBM's create hot shards.

Scatter-Gather Query Execution

When a query arrives, it goes through two phases:

Client
  │
  ▼
Coordinating Node
  │
  ├──── Query Phase ────┐
  │                     │
  ▼         ▼         ▼
Shard 0   Shard 1   Shard 2
  │         │         │
  │  top N  │  top N  │  top N
  │  docIDs │  docIDs │  docIDs
  │         │         │
  ├─── Merge & Sort ───┤
  │                     │
  ├──── Fetch Phase ────┤
  │                     │
  ▼         ▼         ▼
Shard 0   Shard 1   Shard 2
  │         │         │
  │  docs   │  docs   │  docs
  │         │         │
  ├──── Return to Client ──┤

Query phase: The coordinating node sends the query to every shard (primary or replica). Each shard runs the query locally against its Lucene index, scores and sorts the results, and returns the top N document IDs with their scores. No full documents are returned — just IDs and scores.

Fetch phase: The coordinating node merges the sorted ID lists from all shards, picks the global top N, then fetches the actual document bodies from the shards that hold them.

This is the scatter-gather pattern. It works, but it has a cost.

The deep pagination problem. If a user asks for page 100 with 10 results per page (from: 990, size: 10), every shard must return the top 1,000 documents. With 10 shards, the coordinating node merges 10,000 documents to find the right 10. At page 1,000, that's 100,000 documents from each shard. This gets expensive fast.

The fix is search_after — a cursor-based approach that avoids the deep pagination tax entirely. You pass the sort values of the last result, and each shard seeks directly to that point.

# Page 1
GET /products/_search
{
  "size": 10,
  "sort": [{ "created_at": "desc" }, { "_id": "asc" }]
}
# Returns last result with sort values: [1713400000, "doc_xyz"]

# Page 2 — no "from" needed
GET /products/_search
{
  "size": 10,
  "sort": [{ "created_at": "desc" }, { "_id": "asc" }],
  "search_after": [1713400000, "doc_xyz"]
}

Shard Sizing — The 10-50GB Rule

Shard size matters more than shard count. The guidelines from Elastic's own documentation:

Shard Size	Verdict	Why
< 1GB	Too small	Excessive overhead per shard. Merge pressure.
1-10GB	Small but OK	For high-throughput writes or time-based indexes
10-50GB	Sweet spot	Good balance of parallelism and efficiency
50-100GB	Getting large	Recovery after node failure takes longer
> 100GB	Too large	Slow recovery, risky rebalancing

Each shard has overhead: segment metadata in memory, thread pool allocations, cluster state entries. A cluster with 10,000 tiny shards can grind to a halt just from the overhead — even if the total data is small.

The formula for estimating shard count:

primary_shards = ceil(total_data_size / target_shard_size)

Example:
- Expected data size: 200GB
- Target shard size: 40GB
- Primary shards: ceil(200 / 40) = 5

But don't forget growth. If your data will reach 500GB in a year, plan for that. Since you can't change primary shard count later, under-provisioning now means a painful reindex later.

Replicas — Read Scaling and Fault Tolerance

How Replicas Work

A replica is a full copy of a primary shard stored on a different node. Reads can go to either primary or replica. Writes go to the primary first, then get forwarded to replicas.

Write Path:
Client → Coordinating Node → Primary Shard P0 → Replica Shard R0
                                                  (on different node)

Read Path:
Client → Coordinating Node → P0 or R0 (round-robin)

More replicas = more read throughput. With 1 replica, you have 2 copies of each shard to serve reads. With 2 replicas, you have 3 copies. But more replicas also means more disk space and more write amplification (every write goes to primary + all replicas).

# Increase replicas on the fly — no reindex needed
PUT /products/_settings
{
  "number_of_replicas": 2
}

Replica Trade-offs

Replicas	Read Throughput	Write Latency	Disk Usage	Fault Tolerance
0	1x (primary only)	Fastest	1x	None — data loss on node failure
1	2x	Moderate	2x	Survives 1 node failure
2	3x	Higher	3x	Survives 2 node failures
3+	4x+	Highest	4x+	Rarely needed

For most production systems, 1 replica is the standard. Two replicas make sense for high-read workloads or when operating in unstable environments. Zero replicas are acceptable only for ephemeral data you can re-create (like a search index backed by a primary database).

Index Lifecycle Management (ILM) — Hot-Warm-Cold

For time-series data (logs, metrics, events), you don't want to keep last week's logs on expensive SSDs. ILM automates the migration of indexes through hardware tiers.

Hot Tier (NVMe SSDs)          Warm Tier (SSDs)           Cold Tier (HDDs)
┌─────────────────┐          ┌─────────────────┐         ┌─────────────────┐
│  logs-2024-04   │ ──7d──→  │  logs-2024-03   │ ──30d─→ │  logs-2024-01   │
│  (actively      │          │  (read-only,    │         │  (compressed,   │
│   written)      │          │   fewer replicas)│         │   rarely read)  │
└─────────────────┘          └─────────────────┘         └─────────────────┘
                                                                 │
                                                              90d │
                                                                 ▼
                                                              Delete

Hot phase: Active writes and queries. Fast hardware. Full replicas.

Warm phase: Read-only. Force merge segments (fewer, larger segments = faster reads). Shrink replica count. Move to cheaper SSDs.

Cold phase: Rarely queried. Compressed. Stored on cheap HDDs. Consider frozen indexes that unmount from the cluster and reload on demand.

ILM policy example:

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d" }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": { "number_of_replicas": 0 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Every major company running ELK for logging uses some form of ILM. Netflix stores petabytes of log data with aggressive tiering. Hot data on NVMe, cold data on S3-backed snapshots. Without ILM, their storage costs would be 10x higher.

Cross-Cluster Search

For global applications, you might have an ES cluster in each region. US users search the US cluster, EU users search the EU cluster. But sometimes you need to query across both.

US Cluster                         EU Cluster
┌──────────────┐                  ┌──────────────┐
│  products-us │                  │  products-eu │
│  3 shards    │   ←── CCS ──→   │  3 shards    │
└──────────────┘                  └──────────────┘

Cross-cluster search (CCS) lets a coordinating node in one cluster send queries to remote clusters. The query fans out to both local and remote shards, and results are merged.

The latency cost is real. Querying a remote cluster adds network round-trip time (50-200ms cross-region). For global search, many teams replicate the full index to each region instead of using CCS. More storage, but consistent latency.

Elasticsearch vs PostgreSQL GIN

This comparison comes up in interviews constantly. The answer depends on scale and query complexity.

Factor	PostgreSQL GIN + tsvector	Elasticsearch
Data size	< 5M docs	> 5M docs
Query latency	Low for simple queries	Low at any scale
Query complexity	Basic keyword matching	Full BM25, fuzzy, multi-field, aggregations
Operational cost	None (same database)	Separate cluster to manage
Data consistency	ACID — search is always consistent with writes	Eventually consistent (refresh interval)
Schema changes	ALTER TABLE	Reindex
Analytics	Basic COUNT/GROUP BY	Full aggregation framework
Setup time	One CREATE INDEX statement	Cluster provisioning + mapping + analyzers

The honest take: PostgreSQL full-text search is underrated. Hacker News search uses PostgreSQL. It's a 5-million-post corpus with decent query complexity, and PostgreSQL handles it fine. If your only reason for adding Elasticsearch is "we need search," try PostgreSQL's tsvector first. Add ES when you outgrow it.

The signal to switch: when you need custom analyzers, multi-field boosting, function scoring, or your data exceeds what PostgreSQL can index efficiently. That's usually somewhere between 5M and 50M documents, depending on query patterns.

Elasticsearch vs Solr

Both use Lucene. Both support sharding and replication. The choice between them has more to do with ecosystem than technology.

Aspect	Elasticsearch	Solr
Clustering	Built-in, automatic	Requires ZooKeeper (SolrCloud)
API	REST / JSON native	REST, but XML-first heritage
Real-time search	Near-real-time by default	Near-real-time with soft commits
Analytics	Powerful aggregation framework	Facets, pivot facets
Ecosystem	Kibana, Beats, Logstash (ELK)	Banana, Hue (less mature)
Market share	Won decisively	Declining
Community	Massive, active	Smaller, mostly legacy

Elasticsearch won the market war. The ELK stack (Elasticsearch + Logstash + Kibana) became the default observability platform for an entire generation of infrastructure. Solr is still used — Apple's Siri uses Solr for backend search — but new deployments almost always choose ES.

Both Solr and ES support SQL — ES added SQL support in version 6.3 (X-Pack). However, Solr's JDBC integration is more mature for BI tools and analytics teams who think in SQL. For engineering teams building applications, ES's JSON API is more natural.

Patterns for System Design Interviews

Pattern 1: "Design a Logging Pipeline"

App Servers → Filebeat → Logstash → Elasticsearch → Kibana
                (ship)    (parse)     (store/index)   (visualize)

Key design decisions: - Index per day (logs-2024-04-18), not one giant index. Use ILM to tier and delete. - Hot-warm-cold architecture. Today's logs on NVMe, last week's on SSD, last month's on HDD. - Shard count per daily index: 1-3 primaries (daily volume is usually 10-50GB). - Retention: 7 days hot, 30 days warm, 90 days cold, then delete.

Pattern 2: "Design Product Search That Handles Black Friday Traffic"

Normal traffic: 1,000 searches/sec. Black Friday: 50,000 searches/sec.

Approach: Scale replicas before Black Friday. With 3 primary shards and normally 1 replica, you have 6 shards serving reads. Scale to 4 replicas: 15 shards serving reads. 2.5x read capacity without touching primary shards.

Pre-warm caches by running common queries before the traffic spike. ES caches filter results — warm those up.

Pattern 3: "Explain Why Adding More Shards Made Search Slower"

Classic interview trap. More shards means more fan-out during scatter-gather. With 100 shards, a single query hits 100 Lucene indexes and merges 100 result sets. The coordination overhead dominates.

The fix: fewer, larger shards. Or use routing to limit the query to a subset of shards.

Trade-offs Table

Decision	Trade-off
More primary shards	Better write parallelism, but more scatter-gather overhead on reads
Fewer primary shards	Less coordination overhead, but limits write throughput
More replicas	Better read throughput and fault tolerance, but more disk and write latency
Fewer replicas	Less storage and faster writes, but less read capacity and resilience
Smaller shards (< 5GB)	Fast recovery, but excessive memory overhead per shard
Larger shards (> 50GB)	Efficient resource usage, but slow recovery after failure
Custom routing	Faster targeted queries, but risk of hot shards
Default routing	Even distribution, but queries always hit all shards
Lower refresh interval	Documents searchable faster, but more segments and merge pressure
Higher refresh interval	Fewer segments, but longer visibility delay

Scatter Gather

Interview Gotchas

"Can you change the number of primary shards after index creation?"

No. Primary shard count is fixed at creation. To change it, you create a new index with the desired shard count and reindex all data. This can take hours for large indexes. Replicas can be changed at any time.

"A query is slow. The cluster has 200 shards across 5 nodes. What's wrong?"

Likely over-sharding. Each query fans out to all 200 shards. The coordinating node must merge 200 result sets. Solution: consolidate to fewer, larger shards. Or use index aliases with filtered routing.

"How does Elasticsearch handle a node failure?"

If a node with primary shard P0 dies, a replica R0 on another node is promoted to primary. ES then creates a new replica on a surviving node to restore the replication factor. If number_of_replicas: 0, the data is lost. Always run at least 1 replica in production.

"What's the difference between a refresh and a flush?"

A refresh makes documents searchable — the in-memory buffer becomes a new Lucene segment in the filesystem cache. A flush is a Lucene commit — segments are fsync'd to disk and the transaction log is cleared. Refresh = visibility. Flush = durability. They're independent operations.

"How would you estimate cluster sizing for 1TB of search data?"

Back-of-envelope: 1TB data with 1 replica = 2TB raw storage. Add 15% for overhead (segment metadata, doc values, transaction logs) = 2.3TB. Target 40GB per shard = 25 primary shards + 25 replicas = 50 total shards. At 3 shards per node = ~17 nodes. Each node needs 32GB RAM (half for JVM heap, half for OS page cache). This is a medium-sized cluster.

"Why does Elasticsearch use the filesystem cache instead of managing its own cache?"

Because the OS page cache already does this well for immutable files. Lucene segments are immutable once flushed. The OS caches them automatically. Managing a custom cache would duplicate this work and waste memory. This is why ES recommends giving half the RAM to the JVM and leaving the other half for the OS — the OS half is doing real work caching segments.

Key Takeaways

Concept	What to Remember
Primary shards are immutable decisions	Set at index creation, cannot change without reindex
10-50GB per shard	The sweet spot. Over-sharding is the most common mistake.
Scatter-gather	Every query hits every shard. More shards = more overhead.
Replicas are flexible	Add/remove at any time. Serve reads and provide fault tolerance.
ILM for time-series	Hot-warm-cold tiers reduce storage costs by 5-10x.
search_after for pagination	Avoid deep pagination with `from/size`. Use cursors.
Custom routing	Concentrates data for faster queries but risks hot shards.
Refresh != Flush	Refresh = searchable. Flush = durable. Different operations.