Inverted Index Internals — How Lucene Works

TL;DR

An inverted index flips the relationship between documents and words — instead of asking "what words are in this document," it answers "which documents contain this word" near-instantly via the terms dictionary, and Elasticsearch is just Lucene with a distributed shell bolted on top.

What It Is

Inverted Index

Every search engine you've ever used runs on an inverted index. Google, Bing, Elasticsearch, Solr, the search bar on your favorite e-commerce site. All of them.

The idea is dead simple. A forward index maps documents to terms. An inverted index maps terms to documents. That's it. That flip changes everything.

Forward index — what you'd build naively:

Document 1 → ["the", "quick", "brown", "fox"]
Document 2 → ["the", "lazy", "brown", "dog"]
Document 3 → ["quick", "fox", "jumps"]

Searching for "fox" means scanning every document. O(N) where N is the number of documents. With 100 million documents, that's not going to work.

Inverted index — what Lucene actually builds:

"brown" → [Doc 1, Doc 2]
"dog"   → [Doc 2]
"fox"   → [Doc 1, Doc 3]
"jumps" → [Doc 3]
"lazy"  → [Doc 2]
"quick" → [Doc 1, Doc 3]
"the"   → [Doc 1, Doc 2]

Searching for "fox" is now a lookup in Lucene's sorted terms dictionary (an FST — finite state transducer). O(log T) where T is the number of unique terms — effectively near-instant. Searching for "quick fox" means intersecting two posting lists: [Doc 1, Doc 3] and [Doc 1, Doc 3] = [Doc 1, Doc 3]. Still fast.

Elasticsearch wraps Apache Lucene, which does all the real indexing work. ES adds distributed coordination, REST APIs, and cluster management. But the core search engine is Lucene. When your interviewer says "build a search system from scratch," they're really asking you to describe what Lucene does.

When to Use vs Alternatives

Use Case	Best Choice	Why
Full-text search over > 5M docs	Elasticsearch / Lucene	Purpose-built for this
Simple LIKE queries, < 1M docs	PostgreSQL `tsvector` / GIN index	No extra infrastructure
Autocomplete / typeahead	Elasticsearch with `completion` suggester or prefix queries	Optimized for prefix matching
Exact key lookups only	Redis or DynamoDB	Inverted index is overkill
Log search and analytics	Elasticsearch + Kibana (ELK stack)	The industry standard
Semantic / vector search	Dedicated vector DB or ES with `dense_vector`	Inverted indexes don't do similarity

Here's a spicy take: most teams adopt Elasticsearch too early. If you have under 2 million documents and your queries are simple keyword matches, PostgreSQL's built-in full-text search with GIN indexes will do the job. Fewer moving parts. One less cluster to babysit at 3 AM.

Uber learned this the hard way. They ran Elasticsearch for some internal search workloads that could have been served by PostgreSQL, then spent engineering cycles managing cluster splits and shard rebalancing. The right tool depends on scale, not resume-driven development.

Internals — How Lucene Actually Works

The Tokenization Pipeline

Before a document enters the index, Lucene puts its text through an analysis pipeline. This is where raw text becomes searchable tokens.

Raw Text: "The Quick Brown Fox's Tail"
              │
              ▼
   ┌─────────────────┐
   │ Character Filter │  Strip HTML, normalize unicode
   └────────┬────────┘
            ▼
   ┌─────────────────┐
   │    Tokenizer     │  Split on whitespace / punctuation
   └────────┬────────┘
            ▼
   Tokens: ["The", "Quick", "Brown", "Fox's", "Tail"]
            │
            ▼
   ┌─────────────────┐
   │  Token Filters   │  Lowercase, stemming, stop words
   └────────┬────────┘
            ▼
   Final:  ["quick", "brown", "fox", "tail"]

Three stages, each doing one job:

Character filters run first. They modify the raw character stream before tokenization. The most common one strips HTML tags. You can also use them for unicode normalization — turning cafe and café into the same bytes.

Tokenizers split the stream into tokens. The standard tokenizer splits on whitespace and punctuation, handles URLs and emails as single tokens, and strips most punctuation. The whitespace tokenizer splits only on whitespace — keeping punctuation attached. CJK languages need special tokenizers because Chinese and Japanese don't use spaces between words.

Token filters transform individual tokens. This is where the magic happens:

Lowercase — "Quick" becomes "quick". Almost always enabled.
Stop words — removes "the", "a", "is", "and". Controversial. ES disabled this by default because stop words sometimes matter ("to be or not to be").
Stemming — "running", "runs", "ran" all become "run". English uses the Snowball stemmer. This lets a search for "running" match documents containing "ran."
Synonyms — "laptop" also matches "notebook computer." Configured via a synonyms file.

The analyzer you choose determines what matches what. A search that fails isn't always a relevance problem — sometimes it's an analyzer mismatch.

Gotcha

The same analyzer must be used at index time and query time. If you index with stemming but query without it, "running" in the query won't match "run" in the index. This is the number one debugging headache in Elasticsearch.

Posting Lists — The Heart of the Index

The inverted index stores a posting list for each term. A posting list is more than just document IDs:

Term: "elasticsearch"
Posting List: [
    (DocId: 42,  Freq: 3,  Positions: [5, 18, 102]),
    (DocId: 107, Freq: 1,  Positions: [22]),
    (DocId: 215, Freq: 7,  Positions: [1, 9, 15, 33, 44, 67, 89]),
]

Each entry contains:

Field	Purpose	Used For
DocId	Which document	Finding matching documents
Frequency	How many times the term appears	Scoring (more occurrences = more relevant)
Positions	Where in the document the term appears	Phrase queries ("quick brown fox" must appear in that order)

DocIds within a posting list are sorted. This is not arbitrary — it enables fast intersection.

Intersecting two posting lists (AND query) uses a merge-join on sorted lists:

"quick" → [1, 3, 7, 12, 45]
"fox"   → [1, 7, 23, 45, 88]

Pointer A starts at 1, Pointer B starts at 1.
1 == 1 → match! Add to results. Advance both.
3 < 7  → advance A.
7 == 7 → match! Add to results. Advance both.
12 < 23 → advance A.
45 > 23 → advance B.
45 == 45 → match! Add to results. Advance both.

Result: [1, 7, 45]

This is O(N + M) where N and M are the lengths of the two posting lists. For common terms with millions of postings, Lucene uses skip lists — pointers that let you jump ahead, skipping large chunks. Think of it like an express lane on the highway.

Lucene also compresses posting lists aggressively. DocIds are stored as deltas (differences between consecutive IDs), then variable-byte encoded. A posting list with 1 million entries might compress to just a few megabytes.

Segments — Immutable, Append-Only

Here's the part most people miss. Lucene doesn't maintain one big index. It maintains many small segments, each one an independent, immutable mini-index.

                Write Path
                    │
                    ▼
            ┌──────────────┐
            │  In-Memory    │
            │  Buffer       │  New documents land here
            └──────┬───────┘
                   │ flush (every 1 second by default)
                   ▼
    ┌──────────────────────────────┐
    │  Segment 0  │  Segment 1    │
    │  (1000 docs)│  (500 docs)   │
    ├─────────────┼───────────────┤
    │  Segment 2  │  Segment 3    │
    │  (2000 docs)│  (100 docs)   │  ← immutable on disk
    └──────────────────────────────┘
                   │
                   │ periodic merge
                   ▼
    ┌──────────────────────────────┐
    │  Merged Segment              │
    │  (3600 docs)                 │  ← old segments deleted
    └──────────────────────────────┘

The lifecycle:

Buffer — New documents go into an in-memory buffer. Not searchable yet.
Flush — The buffer is flushed to disk as a new segment. Now searchable. This is the "refresh" operation.
Search — A query fans out across all segments, merges results. More segments = more overhead.
Merge — Background process combines small segments into larger ones. Removes deleted documents.

Why immutable? Three reasons.

First, no locking. Readers never block writers, writers never block readers. A search reads from existing segments while new segments are being built in the background. No mutexes, no coordination.

Second, cacheability. An immutable segment can be memory-mapped and cached by the OS page cache. The OS is very good at caching things that don't change.

Third, compression. You can't compress data that's being modified. Immutable segments are compressed once and stay compressed.

This design is nearly identical to LSM trees in Cassandra and LevelDB. If you understand one, you understand the other. The merge process works like LSM compaction — combine smaller files into larger ones, discarding obsolete data.

Deletes and Updates — Tombstone Mechanics

You can't delete from an immutable segment. Instead, Lucene marks the document as deleted in a separate .del file (a bitset). The document still occupies space in the segment but gets filtered out at query time.

Updates are delete-then-insert. Update document 42? Mark the old version as deleted in its segment, write the new version to the in-memory buffer.

Deleted documents are physically removed only during segment merges. Until then, they waste disk space and slightly slow down queries (the filter step). If you do heavy updates, deleted docs can pile up. This is why merge policies matter.

Near-Real-Time Search

The refresh interval — the time between a document being indexed and becoming searchable — defaults to 1 second. This is the "near" in near-real-time.

Timeline:
t=0.0s  Document indexed → sits in memory buffer
t=0.5s  Still not searchable
t=1.0s  Refresh fires → new segment created → NOW searchable

You can lower the refresh interval to 200ms for faster visibility. But every refresh creates a new segment, and more segments mean more merge pressure and more file handles. It's a trade-off.

For bulk indexing (loading millions of documents), set refresh_interval: -1 to disable refreshes entirely. Run the bulk load, then trigger a single manual refresh. Netflix does this when rebuilding their search index — they disable refresh during the bulk load, then flip it back on.

Interview Tip

If an interviewer asks "why isn't my document showing up immediately after indexing?" — the answer is the refresh interval. This is the number one source of confusion for teams new to Elasticsearch. Documents are not searchable until the next refresh.

Patterns for System Design Interviews

Pattern 1: "Build a Search Engine Without Elasticsearch"

The interviewer is testing whether you understand what ES does under the hood.

Your answer: "I'd build an inverted index. Each term maps to a sorted list of document IDs. I'd batch writes into immutable segments, search across all segments in parallel, and merge segments in the background. For relevance, I'd use BM25 scoring on term frequency and document length."

That's literally what Lucene does. You just described 80% of Elasticsearch in four sentences.

Pattern 2: "Design Typeahead / Autocomplete"

Two approaches, both using inverted index concepts:

Prefix trie in the index. ES has a completion field type that builds an FST (finite state transducer) in memory. Fast, but uses RAM.
Edge n-gram tokenizer. Index "elasticsearch" as ["e", "el", "ela", "elas", ...]. A search for "ela" hits the posting list for the "ela" token. Works with standard inverted index machinery.

Pattern 3: "Search Across Multiple Fields"

E-commerce product search: query "wireless headphones" should match title, description, brand, and tags — with different weights.

{
  "multi_match": {
    "query": "wireless headphones",
    "fields": ["title^3", "brand^2", "description", "tags"],
    "type": "best_fields"
  }
}

The ^3 is a boost factor. Title matches are 3x more important than description matches. This is field boosting — covered in detail in the next lesson.

Pattern 4: "Why Not Just Use PostgreSQL Full-Text Search?"

PostgreSQL's tsvector + GIN index is a solid inverted index implementation. Use it when:

Under 5 million documents
Queries are keyword-based (not fuzzy, not multi-field with boosting)
You don't want to operate a separate search cluster
You need transactional consistency between search and relational data

Use Elasticsearch when:

Above 5 million documents
Need advanced analysis (custom tokenizers, language-specific stemming)
Need distributed search across shards
Need near-real-time indexing at high throughput

Trade-offs Table

Aspect	Inverted Index (Lucene/ES)	B-Tree (PostgreSQL)	Hash Index
Full-text search	Excellent	Mediocre (LIKE is O(N))	Not supported
Exact lookups	Good (but overkill)	Excellent	Excellent
Range queries	Moderate	Excellent	Not supported
Write speed	Fast (append-only)	Moderate (in-place updates)	Fast
Space overhead	High (posting lists + positions)	Moderate	Low
Update cost	Expensive (delete + re-insert)	Moderate (in-place)	Cheap
Phrase queries	Native (position data)	Not supported	Not supported
Fuzzy matching	Native (edit distance)	Not supported	Not supported

NRT Refresh

Interview Gotchas

"What happens when you index a document?"

It goes into an in-memory buffer. It's NOT searchable yet. On the next refresh (default 1 second), the buffer is flushed to a new immutable segment on disk. Now it's searchable. Background merge processes periodically combine small segments into larger ones.

"Why are Lucene segments immutable?"

Three reasons: no locking needed (readers and writers don't interfere), OS page cache works optimally on immutable files, and the data can be aggressively compressed. The trade-off is that deletes leave behind tombstones until the next merge.

"How does phrase search work?"

Posting lists store term positions within documents. For the phrase "quick brown fox," Lucene finds documents containing all three terms, then checks that the positions are consecutive: position(quick) + 1 == position(brown) and position(brown) + 1 == position(fox). This is why the positions field in the posting list matters.

"What's the difference between an analyzer and a tokenizer?"

A tokenizer is one component of an analyzer. An analyzer is the full pipeline: character filters, then a tokenizer, then token filters. When people say "the standard analyzer," they mean the standard tokenizer plus lowercase and stop word token filters.

"Can you update a single field in a Lucene document?"

No. Lucene documents are immutable. Updating one field requires deleting the entire document and re-indexing a new version with the updated field. This is why heavy partial updates are expensive in Elasticsearch. If you need fast field-level updates, that's a database problem, not a search engine problem.

"How does Elasticsearch differ from raw Lucene?"

Lucene is a library — a single-node search engine in a JAR file. Elasticsearch adds distribution (sharding, replication), a REST API, cluster coordination, index lifecycle management, and aggregations. Think of it like the difference between SQLite and PostgreSQL. Same core concept, different operational layer.

Key Takeaways

Concept	What to Remember
Inverted index	Maps terms to documents, not documents to terms. O(log T) lookup per term via FST — effectively near-instant.
Posting lists	Store docId, frequency, and positions. Sorted by docId for fast intersection.
Segments	Immutable, append-only. New segments created on refresh, merged in background.
Analysis pipeline	Character filters → tokenizer → token filters. Must match at index and query time.
Refresh interval	Default 1 second. Documents not searchable until refresh.
Deletes	Tombstoned, not removed. Cleaned up during segment merge.
Near-real-time	"Near" because of the refresh interval. Not "real" because of the 1-second gap.