HyperLogLog and Count-Min Sketch

TL;DR

HyperLogLog estimates the number of distinct elements (cardinality) using 12 KB of memory for billions of items with 0.81% standard error. Count-Min Sketch estimates the frequency of elements using a few hundred KB instead of gigabytes. Both are probabilistic: they trade exact answers for dramatically lower memory. Redis supports both natively (PFADD/PFCOUNT for HLL). If someone asks you "how many unique users visited this page today" or "what are the top-K most accessed keys," these are the structures that make the answer feasible at scale.

The Problem

Cardinality Problem

You run a website with 10 billion page views per day. You want to know how many unique visitors you had. Exact answer: store every user ID in a hash set. 10 billion 64-bit IDs = 80 GB. Per metric. Per day. Per page. You cannot afford this.

Frequency Problem

You serve 50 billion API requests per day. You want to know the top 100 most-called endpoints. Exact answer: maintain a counter for every distinct endpoint. If you have millions of distinct endpoints, the hash map grows to gigabytes. For a real-time system, this is too much.

Both problems share the same shape: you have a massive stream of elements, you need aggregate statistics, and you cannot afford to store every element.

The Algorithm: HyperLogLog

Core Intuition

Flip a fair coin repeatedly. If you see a run of 5 heads in a row, you probably flipped the coin about 32 times (2^5). If you see a run of 10 heads, you probably flipped about 1024 times (2^10).

HyperLogLog applies this idea to hash values. Hash each element. Look at the binary representation. Count the number of leading zeros. More leading zeros in the longest run implies more distinct elements.

The Algorithm in Detail

HyperLogLog Register-Based Cardinality Estimation

1. Hash each element to a uniform random bit string.
2. Split the hash into two parts:
   - First p bits: determine which register (bucket) to use.
   - Remaining bits: count the position of the first 1-bit (leading zeros + 1).
3. For each register, keep the MAXIMUM observed count.
4. Estimate cardinality using the harmonic mean of 2^(-register_value) across all registers.

Worked Example

Number of registers: m = 2^p = 2^4 = 16 (p = 4 bits for register index)

Element "alice":
  hash("alice") = 0110 | 00010110...
                  ^^^^   ^^^^^^^^
                  register=6  first 1-bit at position 3 (two leading zeros + 1)
  registers[6] = max(registers[6], 3)

Element "bob":
  hash("bob") = 1010 | 00000011...
                ^^^^   ^^^^^^^^
                register=10  first 1-bit at position 6 (five leading zeros + 1)
  registers[10] = max(registers[10], 6)

Element "alice" again:
  Same hash, same register, same value. No change. (Duplicates are ignored.)

Estimate: harmonic mean across all registers, with correction factors.

The Math

The raw estimate:

E = alpha_m * m^2 * (sum of 2^(-M[j]) for j in 0..m-1)^(-1)

Where alpha_m is a bias correction constant (approximately 0.7213 / (1 + 1.079/m) for large m).

With m = 16384 registers (the standard), each register storing a 6-bit value (max leading zeros = 64 for 64-bit hashes):

Memory: 16384 * 6 bits = 12 KB
Standard error: 1.04 / sqrt(m) = 1.04 / 128 = 0.81%
Can count up to ~2^64 distinct elements

12 KB to estimate the cardinality of billions of items with less than 1% error. This is why HyperLogLog is one of the most impactful data structures in practice.

Corrections for Small and Large Counts

Small range correction: When the estimate is below 5/2 * m and some registers are zero, use LinearCounting (count zero registers) instead.
Large range correction: When the estimate exceeds 2^32 / 30, apply a correction for hash collision effects. (Less relevant with 64-bit hashes.)

Redis Implementation

PFADD visitors "user123"       # Add element
PFADD visitors "user456"
PFCOUNT visitors               # Get estimated cardinality → 2
PFMERGE total day1 day2 day3   # Merge multiple HLLs (union)

Redis uses 12 KB per HyperLogLog. The PFMERGE operation is what makes HyperLogLog powerful for aggregation: you can compute "unique visitors across all pages" by merging per-page HLLs. The merge is just a register-wise max operation.

The Algorithm: Count-Min Sketch

Core Intuition

Count-Min Sketch Multi-Row Hash Structure

Instead of one hash function mapping to one counter, use d hash functions mapping to d arrays of w counters. When you insert, increment the counter at position hash_i(x) % w for each of the d arrays. When you query, return the minimum across all d counters.

The minimum is the key: hash collisions can only increase a counter, never decrease it. Taking the minimum gives the least-inflated estimate.

Structure

d arrays (rows), each with w counters (columns)

        col 0   col 1   col 2   ... col w-1
row 0:  [  0  ] [  0  ] [  0  ] ... [  0  ]
row 1:  [  0  ] [  0  ] [  0  ] ... [  0  ]
...
row d-1:[  0  ] [  0  ] [  0  ] ... [  0  ]

Insert

def insert(x):
    for i in 0..d-1:
        j = hash_i(x) % w
        table[i][j] += 1

Query

def query(x):
    return min(table[i][hash_i(x) % w] for i in 0..d-1)

Sizing

For error tolerance epsilon and failure probability delta:

w = ceil(e / epsilon)       # width (number of counters per row)
d = ceil(ln(1 / delta))    # depth (number of rows)

The estimate is at most epsilon * N higher than the true count (where N is the total number of insertions), with probability at least 1 - delta.

Worked Example

w = 5, d = 3
Total stream: "a", "b", "a", "c", "a", "b", "d", "a"

After all insertions:

        col0  col1  col2  col3  col4
row 0:  [ 1 ] [ 4 ] [ 0 ] [ 2 ] [ 1 ]
row 1:  [ 2 ] [ 1 ] [ 4 ] [ 0 ] [ 1 ]
row 2:  [ 0 ] [ 1 ] [ 2 ] [ 4 ] [ 1 ]

Query "a" (true count = 4):
  row 0: hash_0("a") % 5 = 1 → counter = 4
  row 1: hash_1("a") % 5 = 2 → counter = 4
  row 2: hash_2("a") % 5 = 3 → counter = 4
  Estimate: min(4, 4, 4) = 4. Exact.

Query "d" (true count = 1):
  row 0: hash_0("d") % 5 = 0 → counter = 1
  row 1: hash_1("d") % 5 = 0 → counter = 2  (collision with another element)
  row 2: hash_2("d") % 5 = 4 → counter = 1
  Estimate: min(1, 2, 1) = 1. Exact (the min rescued us from the collision).

Real Sizing: YouTube Top-K

YouTube wants the top 100 most-watched videos. 5 billion views per day across 800 million distinct videos.

Exact approach: HashMap of 800M entries. At ~100 bytes per entry (64-byte key + counter + overhead) = 80 GB RAM.

Count-Min Sketch: epsilon = 0.0001, delta = 0.01.

w = ceil(e / 0.0001) = 27,183
d = ceil(ln(100)) = 5
Memory: 27,183 * 5 * 8 bytes (64-bit counters) = ~1.09 MB

Pair the CMS with a min-heap of size 100. For each view, increment in CMS, check if estimated count exceeds the heap minimum, and insert if so. Total memory: ~1.1 MB instead of 80 GB.

In practice, you might use wider parameters for safety: w = 100,000 and d = 7, using ~5.6 MB. Still trivial compared to 80 GB.

Proof/Correctness Intuition

HyperLogLog: Why Leading Zeros Estimate Cardinality

For a good hash function, each bit is independently 0 or 1 with probability 1/2. The probability of seeing k leading zeros is 1/2^k. If you observe k leading zeros, the expected number of distinct elements hashed is 2^k.

Using multiple registers (stochastic averaging) reduces variance. The harmonic mean is used instead of the arithmetic mean because it is more robust to outliers: a single register with an unusually high value does not dominate the estimate.

Count-Min Sketch: Why Min Works

Each counter tracks the true count of its target element plus the counts of all other elements that hash to the same position (collisions). Collisions only increase the counter. Therefore, every counter is an overestimate. Taking the minimum across d independent hash functions gives the tightest overestimate. The probability that all d rows have a large collision is exponentially small in d.

Real-World Usage

Structure	System	Use Case
HLL	Redis	`PFADD`/`PFCOUNT`: unique visitor counting
HLL	BigQuery	`APPROX_COUNT_DISTINCT`: approximate distinct count
HLL	Presto/Trino	`approx_distinct()` function
HLL	Druid	Pre-computed HLL columns for fast cardinality
CMS	Apache Spark	Frequency estimation in streaming pipelines
CMS	Flink	Approximate frequency in event streams
CMS	Networking	Heavy hitter detection, DDoS mitigation

BigQuery's APPROX_COUNT_DISTINCT is a common practical encounter. It runs 2-10x faster than exact COUNT(DISTINCT ...) and uses far less memory. For dashboards where 1% error is acceptable, it is the right choice.

Interview Application

When to mention HyperLogLog:

"How would you count unique users per page at scale?" -- HLL per page, 12 KB each, merge for aggregates.
"How would you design an analytics dashboard?" -- Pre-compute HLLs per dimension, merge on query.
"How does Redis handle PFCOUNT?" -- HyperLogLog with 16384 registers.

When to mention Count-Min Sketch:

"How would you find the top-K most popular items in a stream?" -- CMS + min-heap.
"How would you detect heavy hitters in network traffic?" -- CMS with threshold alerting.
"How would you rate-limit per-user API calls approximately?" -- CMS as an approximate counter per user ID.

What interviewers want to hear:

You know HLL uses 12 KB for billions of items.
You know CMS always overestimates (never underestimates).
You can size both structures with the formulas.
You know when to use exact counting vs. approximate (dashboards: approximate; billing: exact).

Trade-offs

Aspect	HyperLogLog	Count-Min Sketch
Problem	Cardinality (how many unique?)	Frequency (how often?)
Error type	Relative (~0.81%)	Additive (epsilon * N)
Memory	Fixed 12 KB	Configurable (epsilon, delta)
Merge-able	Yes (register-wise max)	Yes (cell-wise sum)
Element removal	No	No (unless count-min-log)
Redis support	PFADD/PFCOUNT/PFMERGE	Via RedisBloom module

When NOT to Use

HLL: When you need exact counts (billing, compliance). When cardinality is small enough for exact counting (under 100K).
CMS: When you need exact frequency (financial auditing). When the element space is small enough for a HashMap.

Combining With Other Structures

A common pattern: use a Bloom filter to check existence, HLL to estimate cardinality, and CMS to estimate frequency. Together they cover the three fundamental questions about a data stream: "Is X in the stream?", "How many distinct elements?", and "How often does X appear?" -- all in sub-linear memory.

Common Mistakes

HyperLogLog stores the elements

HLL stores only the register values (maximum leading-zero counts). You cannot enumerate which elements were added. You cannot query whether a specific element was added. It only answers "how many distinct elements total."

Count-Min Sketch can underestimate

CMS always overestimates or returns the exact count. It never underestimates. Every collision adds to a counter, never subtracts. This is a guarantee, not a probabilistic statement.

HLL is only useful for very large datasets

Even for millions of items, HLL is valuable when you need to aggregate across many dimensions. If you have 1 million pages and want unique visitors per page, exact counting requires 1M hash sets. HLL requires 1M * 12 KB = 12 GB of fixed-size structures that can be merged, stored, and transmitted efficiently.

CMS with more rows is always better

More rows (d) reduces the probability of all rows having collisions, but each additional row requires computing another hash. Beyond d = 7-10, the marginal improvement is negligible. Width (w) matters more for accuracy; depth (d) matters more for confidence.

You can use HLL for intersection

HLL supports union (merge via register-wise max) but not intersection. You can estimate |A ∩ B| using the inclusion-exclusion formula |A| + |B| - |A ∪ B|, but this estimate has high relative error when the intersection is small compared to the union. Do not rely on it.

These structures are too inaccurate for production

Google, Facebook, and Netflix run these in production for analytics, monitoring, and recommendations. The 0.81% error of HLL is invisible on a dashboard showing "2.3 million unique visitors." The difference between 2,300,000 and 2,318,630 does not change any decision.