Skip to content

Relevance Scoring — BM25, Boosting, and Tuning

TL;DR

Search is useless if results come back in the wrong order — BM25 replaced TF-IDF as the default scoring algorithm in Elasticsearch 5.0 because it handles term frequency saturation and document length normalization, and knowing how to tune it separates "I used Elasticsearch" from "I understand search."

What It Is

BM25 Scoring

Relevance scoring is the algorithm that decides which matching documents appear first. Finding documents that contain your search terms is the easy part. Ranking them so the best result sits at position 1 is the hard part.

Two users search for "python tutorial." One wants a beginner guide. The other wants advanced metaprogramming. Both queries contain the same terms. Scoring determines whose tutorial appears first, and it's the reason Google displaced every other search engine in the late '90s — their PageRank scoring was just better.

Elasticsearch uses BM25 as its default scoring function since version 5.0. Before that, it used TF-IDF. Understanding both matters because BM25 is TF-IDF's direct successor, and interviewers love asking why the switch happened.

When to Use Scoring vs When to Skip It

Not every query needs scoring. This distinction trips up a lot of teams.

Query Type Needs Scoring? Example
Full-text search Yes "best wireless headphones for running"
Filtering No "show all products where category = electronics"
Exact match No "find user with email = alice@example.com"
Aggregations No "count orders by status"
Log search Sometimes "find ERROR logs" (filter) vs "most relevant error" (score)

When scoring doesn't matter, use a filter context instead of a query context in Elasticsearch. Filters skip scoring entirely, run faster, and get cached. Teams that move non-scoring clauses from query to filter typically see significant latency improvements, since filters are cached and skip scoring.

{
  "query": {
    "bool": {
      "must": {
        "match": { "description": "ocean view" }
      },
      "filter": [
        { "term": { "city": "San Francisco" } },
        { "range": { "price": { "lte": 200 } } }
      ]
    }
  }
}

The must clause scores. The filter clauses don't. This is the correct pattern for most e-commerce and marketplace search — score on text relevance, filter on structured attributes.

Internals — TF-IDF and Its Problems

Term Frequency (TF)

How many times does the search term appear in the document? More occurrences should mean higher relevance.

Document A: "Redis is a data structure server. Redis stores data in memory."
Document B: "Redis is popular."

Search: "Redis"

TF in Document A = 2
TF in Document B = 1

Document A scores higher for "Redis" — it mentions Redis twice.

Makes sense so far. But there's a problem. What if a document mentions "Redis" 500 times? Is it really 500x more relevant than a document that mentions it once? No. After a certain point, additional occurrences add almost nothing. A document about Redis that says the word 50 times isn't meaningfully more relevant than one that says it 10 times. TF-IDF uses raw term frequency, so it doesn't handle this well.

Inverse Document Frequency (IDF)

Terms that appear in many documents are less useful for distinguishing relevant results. "The" appears in every English document — it has zero discriminating power. "Elasticsearch" appears in fewer documents — it's more informative.

Corpus: 1,000,000 documents

"the"           appears in 950,000 docs → IDF = log(1M / 950K) = 0.02 (low)
"database"      appears in  50,000 docs → IDF = log(1M / 50K)  = 3.0  (medium)
"elasticsearch" appears in   2,000 docs → IDF = log(1M / 2K)   = 6.2  (high)

IDF boosts rare terms and dampens common ones. This is the "inverse" part — the more documents a term appears in, the lower its IDF score.

TF-IDF Formula

score(term, document) = TF(term, document) × IDF(term, corpus)

Simple multiplication. For a multi-term query, sum the scores:

score("redis database", doc) = TF("redis", doc) × IDF("redis")
                              + TF("database", doc) × IDF("database")

Why TF-IDF Falls Short

Two big problems:

No term frequency saturation. If "Redis" appears 100 times in a document, TF-IDF gives it 100x the weight of a document where "Redis" appears once. In practice, the 100-occurrence document might be a spammy SEO page that stuffed the keyword. You want diminishing returns after a certain frequency.

No document length normalization. A 10,000-word document will naturally contain more term occurrences than a 100-word document. TF-IDF doesn't account for this, so longer documents have an unfair advantage.

These flaws are why Elasticsearch switched to BM25.

BM25 — The Better Algorithm

BM25 (Best Matching 25 — the 25th iteration of a series of ranking functions from the 1990s) fixes both TF-IDF problems with two parameters.

The Formula

score(term, doc) = IDF(term) × [ TF(term, doc) × (k1 + 1) ]
                                 ─────────────────────────────
                                 TF(term, doc) + k1 × (1 - b + b × (docLength / avgDocLength))

Don't memorize the formula. Understand what the two knobs do.

Parameter k1 — Term Frequency Saturation

k1 controls how quickly the score saturates as term frequency increases. Default: 1.2.

TF=1:   score contribution = 1.0   (baseline)
TF=2:   score contribution = 1.5   (50% boost)
TF=5:   score contribution = 1.8   (80% of max — diminishing returns)
TF=10:  score contribution = 1.9   (nearly flat — more occurrences barely matter)
TF=100: score contribution = 1.99  (essentially capped)

Compare this to TF-IDF where TF=100 gives 100x the score. BM25 with k1=1.2 essentially caps the benefit at around TF=5-10. A document mentioning "Redis" 5 times is almost as relevant as one mentioning it 100 times.

  • Lower k1 (e.g., 0.5): saturates faster, less sensitive to term frequency
  • Higher k1 (e.g., 3.0): slower saturation, term frequency matters more
  • k1 = 0: term frequency is completely ignored — only presence/absence matters

Parameter b — Length Normalization

b controls how much document length affects scoring. Default: 0.75.

b = 0:    No length normalization — long and short documents treated equally
b = 1:    Full normalization — score fully adjusted for document length
b = 0.75: Default — moderate normalization

A 10,000-word document naturally has more term occurrences than a 200-word document. With b=0.75, BM25 compensates for this. The long document doesn't get an unfair advantage just because it's long.

  • Lower b: favors longer documents (useful when longer documents genuinely contain more info)
  • Higher b: penalizes longer documents more aggressively (useful when short, focused documents are better)

When to Tune BM25

Most teams never touch k1 and b. The defaults work well for general-purpose search. But there are exceptions.

Short documents (tweets, product titles): Set b lower (0.3-0.5). Short documents have little variation in length, so length normalization adds noise.

Keyword-stuffed content (SEO-heavy sites): Set k1 lower (0.5-0.8). You want faster saturation so that repeating a term 50 times doesn't help.

Academic papers or legal documents: Set k1 higher (1.5-2.0). In these domains, term frequency genuinely signals relevance — a paper that discusses "transformer" 20 times is likely more focused on transformers than one that mentions it twice.

PUT /my_index
{
  "settings": {
    "similarity": {
      "custom_bm25": {
        "type": "BM25",
        "k1": 0.8,
        "b": 0.4
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "similarity": "custom_bm25"
      }
    }
  }
}

Field Boosting — Not All Fields Are Equal

In real applications, a match in the title is more important than a match in the body. Field boosting lets you express this.

{
  "multi_match": {
    "query": "distributed systems",
    "fields": ["title^5", "abstract^2", "body"],
    "type": "best_fields"
  }
}

A match in title gets 5x the score of a match in body. A match in abstract gets 2x. This simple pattern handles 90% of multi-field search needs.

Multi-match types:

Type Behavior Use When
best_fields Score = highest scoring field Terms should appear in the same field
most_fields Score = sum of all field scores Each field adds signal
cross_fields Treats all fields as one big field Person name split across first_name and last_name
phrase Runs phrase match on each field Exact phrase must appear

best_fields is the default and works for most cases. cross_fields is the one that surprises people — it's the right choice when a single concept spans multiple fields. Searching for "John Smith" should match first_name: "John" and last_name: "Smith", not require both terms in the same field.

Function Scoring — Beyond Text Relevance

Text relevance alone isn't enough. An e-commerce site needs to balance relevance with popularity, recency, and profit margin. A news site needs to factor in freshness. A social platform needs personalization.

Function scoring lets you combine BM25 with business signals:

{
  "query": {
    "function_score": {
      "query": {
        "match": { "title": "wireless headphones" }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "modifier": "log1p",
            "factor": 0.5
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "7d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": { "term": { "sponsored": true } },
          "weight": 1.5
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

What this does:

  1. BM25 scores the text relevance of "wireless headphones"
  2. Popularity factor boosts popular products (log scale to prevent one viral product from dominating)
  3. Recency decay boosts newer products — 50% decay per week
  4. Sponsored boost gives sponsored products a 1.5x multiplier

The final score = BM25 x popularity x recency x sponsored.

This is how Amazon's product search works conceptually. Text relevance is just one signal. Sales velocity, profit margin, Prime eligibility, and seller rating all feed into the ranking function. Amazon reportedly uses hundreds of signals — but the architecture is the same function_score concept.

Gotcha

Don't boost by raw popularity without a logarithmic modifier. If product A has 10,000 reviews and product B has 10, a linear boost makes B invisible regardless of text relevance. Use log1p or sqrt to compress the scale.

Analyzers for Different Use Cases

The analyzer determines what text matches what. Choosing the wrong one silently breaks search.

Standard Analyzer (Default)

Input:  "The Quick-Brown Fox's 2024 Strategy"
Output: ["the", "quick", "brown", "fox's", "2024", "strategy"]

Lowercases, splits on word boundaries, keeps numbers. Good enough for English in most cases.

Custom Analyzer for E-Commerce

Product search needs special handling. "iPhone 15 Pro Max" should match "iphone15", "iPhone 15", and "iphone pro."

{
  "settings": {
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "product_synonyms", "edge_ngram_filter"]
        }
      },
      "filter": {
        "product_synonyms": {
          "type": "synonym",
          "synonyms": ["iphone, apple phone", "laptop, notebook"]
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        }
      }
    }
  }
}

Language-Specific Analyzers

English stemming reduces "running" to "run." But every language has different rules.

Language Analyzer Stemming Example
English english "running" → "run"
German german "Freundlichkeit" → "freundlich"
French french "mangeaient" → "mang"
Chinese icu_analyzer + CJK tokenizer Character-based, no word boundaries
Japanese kuromoji Morphological analysis required

Chinese and Japanese are hard. There are no spaces between words, so the tokenizer must understand the language to split correctly. This is why Baidu and Yahoo Japan built custom tokenizers. If your system design involves CJK search, mention this complexity — it shows depth.

Patterns for System Design Interviews

Pattern 1: "Design Product Search for an E-Commerce Site"

The scoring layer matters here. Don't just say "use Elasticsearch."

Step 1: Define the ranking signals: - BM25 on title (5x boost), description, tags - Popularity (sales count, log-scaled) - Recency (newer listings preferred for fashion, less relevant for electronics) - Review score (4.5 stars beats 3.0 stars) - Conversion rate (products people actually buy after viewing)

Step 2: Use function_score to combine them. BM25 handles text. Function scores handle business logic. Multiply mode keeps both signals relevant.

Step 3: Filters don't score. Category, price range, brand — these are filters, not scoring signals. Move them to filter context.

Pattern 2: "Why Are Users Getting Bad Search Results?"

Debugging relevance is a real interview topic. Walk through the checklist:

  1. Analyzer mismatch — Is the query analyzed the same way as the indexed text?
  2. Missing field boosts — Is title weighted higher than body?
  3. TF gaming — Are spammy documents stuffing keywords? Lower k1.
  4. Length bias — Are long documents dominating? Increase b.
  5. No function scoring — Is the system ignoring popularity, recency, or quality signals?

Use the Explain API to debug:

GET /products/_explain/42
{
  "query": {
    "match": { "title": "wireless headphones" }
  }
}

This returns a breakdown of exactly how the score was calculated for document 42. Every factor, every multiplication. It's the single most useful debugging tool in Elasticsearch.

Pattern 3: "Design a News Feed with Personalized Ranking"

Combine text relevance with user-specific signals:

  • BM25 on the article text
  • Topic affinity: boost categories the user reads frequently
  • Social signal: boost articles shared by people the user follows
  • Freshness: exponential decay — a 1-hour-old article beats a 24-hour-old one

Facebook's News Feed ranking (before the algorithm changes) used a similar multi-signal scoring approach. They called it EdgeRank originally — affinity x weight x decay.

Trade-offs Table

Scoring Approach Pros Cons Best For
BM25 (default) No config needed, handles most cases Ignores business context General-purpose search
Custom BM25 (tuned k1/b) Better for domain-specific content Requires experimentation Short docs, academic, legal
Field boosting Simple, effective Static weights (can't personalize) Multi-field search
Function scoring Combines text + business signals Complex to debug, easy to over-engineer E-commerce, marketplaces
No scoring (filters only) Fastest, cacheable No ranking Faceted navigation, exact match
Learning to Rank (LTR) Best relevance possible Needs training data, ML pipeline Large-scale search (Google-level)

Function Score

Interview Gotchas

"What's the difference between TF-IDF and BM25?"

BM25 adds two things: term frequency saturation (controlled by k1) and document length normalization (controlled by b). In TF-IDF, a document with 100 occurrences of a term gets 100x the score of a document with 1 occurrence. In BM25, the benefit plateaus after around 5-10 occurrences. BM25 also penalizes documents that are longer than average, so a 10,000-word document doesn't dominate just because it's long.

"When would you NOT use BM25?"

When you don't care about ranking. Filtering queries (find all documents where status = active), exact-match lookups, and aggregations don't need scoring. Use filter context to skip BM25 entirely — it's faster and cacheable.

"How do you handle typos in search?"

Fuzzy matching. ES supports edit-distance-based fuzzy queries ("query": {"fuzzy": {"title": {"value": "headphnes", "fuzziness": 1}}}). This finds "headphones" even though the user typed "headphnes." For autocomplete, edge n-grams work better — index "headphones" as ["he", "hea", "head", ...] and match on prefix.

"How would you A/B test a ranking change?"

Index the data once. Route a percentage of traffic to query-A (old ranking) and the rest to query-B (new ranking). Measure click-through rate, time to first click, and conversion rate. This is what Etsy and Yelp do for search experiments — same index, different queries.

"Can you combine Elasticsearch scoring with a machine learning model?"

Yes. Elasticsearch supports Learning to Rank (LTR) via plugins. You train a model (usually gradient-boosted trees) on click-through data, export it as a model definition, and ES evaluates it at query time. LinkedIn uses this for their search ranking. But you need significant click-through data to train the model — don't suggest it for a greenfield project.

"What happens if you boost too aggressively?"

The boosted field dominates scoring. If you set title^100, a mediocre title match will outrank an excellent body match. This is called "score explosion." Keep boosts relative (title^3, abstract^2, body^1) and test with real queries.

Key Takeaways

Concept What to Remember
BM25 replaced TF-IDF Term frequency saturation (k1) and length normalization (b) are the improvements
k1 default 1.2 Controls how fast term frequency benefit plateaus
b default 0.75 Controls how much longer documents are penalized
Filter vs query context Filters skip scoring, run faster, get cached
Field boosting Title ^3, abstract ^2, body ^1 — keep it relative
Function scoring Multiply BM25 with business signals (popularity, recency, quality)
Explain API The debugging tool for relevance — shows exact score breakdown per document