Vector Databases — HNSW, IVF, and Similarity Search

TL;DR

Vector databases store high-dimensional vectors (embeddings) and find the nearest neighbors efficiently. Exact KNN is O(n) -- useless at scale. HNSW (Hierarchical Navigable Small World) builds a navigable graph for O(log n) approximate search with the best recall. IVF (Inverted File Index) clusters vectors and searches within relevant clusters for faster build times. Product Quantization compresses vectors to reduce memory. Pinecone, Weaviate, Milvus, and pgvector are the main players. The use cases are semantic search, recommendation engines, and retrieval-augmented generation (RAG) for LLMs. This is one of the most rapidly evolving areas in infrastructure.

The Problem

You have 100 million product descriptions, each converted to a 768-dimensional vector by an embedding model (like sentence-transformers). A user searches "comfortable running shoes for flat feet." You convert the query to a vector and need to find the 10 most similar product vectors.

Exact brute-force: compute the distance from the query vector to all 100 million product vectors. Each distance computation is 768 multiply-and-add operations. Total: 76.8 billion operations per query. At 10 GFLOPS on a single core, that is 7.7 seconds. Unacceptable for a search engine.

Storage: 100 million vectors * 768 dimensions * 4 bytes (float32) = 307 GB. This fits in RAM on a large machine, but the computation time is the bottleneck.

You need an index that trades some accuracy (finding the approximate nearest neighbors instead of exact) for dramatically lower search time.

Background: Embeddings and Similarity

What Are Embeddings?

An embedding model converts unstructured data (text, images, audio) into a fixed-size vector in a high-dimensional space. Similar items are placed close together in this space.

"running shoes"       → [0.12, -0.45, 0.78, ..., 0.33]  (768 dims)
"jogging sneakers"    → [0.11, -0.44, 0.77, ..., 0.34]  (very close)
"medieval armor"      → [-0.88, 0.23, -0.15, ..., 0.67] (far away)

Common embedding models: - Text: OpenAI text-embedding-3-small (1536 dims), sentence-transformers (384-768 dims) - Images: CLIP (512 dims), ResNet (2048 dims) - Multimodal: CLIP encodes both text and images into the same vector space

Distance Metrics

Metric	Formula	When to Use
Cosine	1 - (A . B) / (
Dot Product	A . B	Recommendation, when magnitude matters
L2 (Euclidean)	sqrt(sum((Ai - Bi)^2))	Image similarity, general purpose

For normalized vectors (unit length), cosine similarity and dot product are equivalent. Most text embedding models produce normalized vectors, so cosine and dot product give the same ranking.

The Algorithms

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node is a vector. The top layers are sparse (few nodes, long-range connections). The bottom layer is dense (all nodes, short-range connections). Search starts at the top and greedily navigates down.

Construction:

For each new vector v:
  1. Randomly assign a maximum layer L (geometric distribution, like skip list).
  2. Starting from the entry point at the top layer:
     a. Greedily find the nearest node at this layer.
     b. Drop down to the next layer.
  3. At each layer from L down to 0:
     a. Find the M nearest neighbors of v at this layer.
     b. Connect v to these neighbors (bidirectional edges).

Search:

For query vector q, find K nearest neighbors:
  1. Start at the entry point (top layer).
  2. At each layer (top to bottom):
     a. Greedily move to the node closest to q.
     b. This becomes the entry point for the next layer.
  3. At the bottom layer (layer 0):
     a. Maintain a dynamic candidate list.
     b. Explore neighbors of candidates, expanding the search.
     c. Return the K closest vectors found.

Why it works: The top layers provide long-range "highways" that quickly navigate to the right region of the space. The bottom layer provides fine-grained exploration within that region. This is analogous to a skip list but in high-dimensional space.

Parameters:

Parameter	Meaning	Typical Value
M	Max connections per node per layer	16-64
efConstruction	Candidate list size during build	200-500
efSearch	Candidate list size during query	50-200

Higher M = better recall, more memory, slower build. Higher efSearch = better recall, slower query.

Worked Example: HNSW Search

5 vectors in 2D (simplified):
  A=(1,1), B=(3,1), C=(5,5), D=(7,3), E=(9,9)

Layer 2: [A] ——— [C] ——— [E]
Layer 1: [A] — [B] — [C] — [D] — [E]
Layer 0: [A] — [B] — [C] — [D] — [E]  (fully connected within neighborhoods)

Query: q = (6, 4), find 1 nearest neighbor.

Layer 2: Start at A(1,1). Distance to q: 5.83.
         Neighbor C(5,5). Distance to q: 1.4. Closer. Move to C.
         Neighbor E(9,9). Distance to q: 5.8. Not closer. Stay at C.
         Drop to Layer 1 at C.

Layer 1: At C(5,5). Check neighbors:
         B(3,1): distance 4.2. Not closer.
         D(7,3): distance 1.4. Same as C. Check D's neighbors.
         Move to D(7,3). Distance = 1.4.

Layer 0: At D(7,3). Expand search around D.
         Nearest neighbor is D(7,3) at distance 1.4.
         (In reality, C and D are equidistant; return either.)

Total nodes examined: 5 out of 5. For n = 1 million, you'd examine ~100-200.

IVF (Inverted File Index)

IVF pre-clusters the vectors, then only searches within relevant clusters.

Construction:

1. Run K-means clustering on all vectors → K centroids (typically K = sqrt(n)).
2. Assign each vector to its nearest centroid.
3. Build an inverted list: centroid → list of vectors in that cluster.

Search:

For query vector q:
  1. Find the nprobe nearest centroids to q.
  2. Search all vectors in those nprobe clusters.
  3. Return the K nearest from the searched vectors.

Parameters:

Parameter	Meaning	Typical Value
nlist	Number of clusters (K)	sqrt(n) to 4*sqrt(n)
nprobe	Clusters to search at query time	1-100

Trade-off: Higher nprobe = better recall, slower query. nprobe = 1 is fast but may miss neighbors in adjacent clusters. nprobe = nlist is exact brute-force search.

Product Quantization (PQ)

PQ compresses vectors to reduce memory and speed up distance computation.

Original vector: 768 dimensions, 4 bytes each = 3072 bytes

Product Quantization:
  1. Split 768 dims into 96 sub-vectors of 8 dims each.
  2. For each sub-vector space, learn 256 centroids (codebook) via K-means.
  3. Replace each sub-vector with its nearest centroid's index (1 byte).

Compressed vector: 96 bytes (96 sub-vectors * 1 byte per index)
Compression ratio: 3072 / 96 = 32x

Distance computation uses precomputed distance tables: compute the distance from the query's sub-vectors to all 256 centroids in each codebook (96 * 256 lookups), then sum the table values for each database vector (96 additions per vector).

IVF-PQ: Combine IVF (for coarse search) with PQ (for memory efficiency). This is the workhorse of large-scale vector search. Faiss (Facebook AI Similarity Search) provides highly optimized implementations.

100 million vectors * 768 dims * 4 bytes = 307 GB (raw)
100 million vectors * 96 bytes (PQ)       = 9.6 GB (compressed)

9.6 GB fits comfortably in RAM on a single machine.

Proof/Correctness Intuition

Why HNSW Achieves O(log n) Search

HNSW's layer structure is analogous to a skip list. The top layer has O(log n) nodes, each subsequent layer has geometrically more. The greedy search at the top layer takes O(1) steps (few nodes, long jumps). Each subsequent layer refines the search within a smaller region. The total number of steps across all layers is O(log n).

The "navigable small world" property means that any node can be reached from any other node in O(log n) hops via the graph structure. This is the same property that underlies social networks (six degrees of separation) and is what makes greedy search effective.

Why IVF Has a Recall-Speed Trade-off

K-means clustering partitions the space into Voronoi cells. A vector's nearest neighbor in the database might be in an adjacent cluster, not the same cluster. Searching only nprobe = 1 cluster misses these cross-boundary neighbors. Increasing nprobe recovers them at the cost of searching more vectors.

For uniformly distributed data, the probability of a nearest neighbor being in the correct cluster is approximately nprobe / nlist. For clustered data (which real embeddings tend to be), the probability is higher because similar items cluster together.

Real-World Usage

System	Index Types	Key Feature
Pinecone	Proprietary (likely HNSW)	Fully managed, serverless option
Weaviate	HNSW	Object storage + vector search
Milvus	IVF, HNSW, DiskANN	Open source, GPU support
Qdrant	HNSW	Rust, filtering during search
pgvector	IVF, HNSW	PostgreSQL extension
Faiss	IVF-PQ, HNSW, flat	Library (not a database)
Elasticsearch	HNSW (dense_vector)	Combined text + vector search

pgvector deserves attention for pragmatic reasons. If you already run PostgreSQL, pgvector lets you add vector search without a new database. You can JOIN vectors with relational data, use transactions, and leverage existing PostgreSQL tooling. The trade-off: it is slower than purpose-built vector databases at large scale.

Use Cases

Use Case	How Vectors Are Used
Semantic search	Query and documents as vectors; nearest = most relevant
Recommendation	User and item embeddings; nearest items to user = recommendations
RAG (LLMs)	Chunk documents, embed, retrieve relevant chunks for LLM context
Duplicate detection	Near-duplicate images/text have nearby vectors
Anomaly detection	Normal data forms clusters; anomalies are far from all clusters

Interview Application

When to mention vector databases:

"Design a semantic search engine." -- Embed documents, store in vector DB, query with embedded search terms.
"Design a recommendation system." -- User and item embeddings, approximate nearest neighbor search.
"How does RAG work?" -- Chunk documents, embed chunks, store in vector DB, retrieve relevant chunks for LLM prompt.
"How would you detect duplicate images?" -- Embed images with CLIP, find near neighbors, threshold distance.

What interviewers want to hear:

You know exact KNN is O(n) and impractical at scale.
You can explain HNSW (multi-layer navigable graph) or IVF (cluster then search).
You understand the recall vs. speed trade-off.
You know Product Quantization reduces memory from 300 GB to 10 GB.
You can make the pgvector vs. dedicated vector DB decision.

Decision framework:

< 1 million vectors?              → pgvector (keep it simple)
1-100 million vectors?            → Dedicated vector DB (Qdrant, Weaviate)
> 100 million vectors?            → Milvus or Faiss with sharding
Need relational JOINs with vectors? → pgvector
Need sub-10ms latency?            → HNSW with enough RAM
Memory-constrained?               → IVF-PQ (32x compression)

Trade-offs

HNSW

Aspect	HNSW	IVF	IVF-PQ
Build time	Slow (graph construction)	Medium (K-means)	Medium (K-means + PQ)
Query time	Fast (O(log n))	Depends on nprobe	Fast (compressed distances)
Memory	High (graph + vectors)	High (all vectors in RAM)	Low (compressed vectors)
Recall@10	95-99% (tunable)	80-95% (nprobe dependent)	70-90% (PQ adds error)
Update	Supports insert	Requires rebuild for clusters	Requires rebuild
Best for	Highest recall needed	Balanced build/query	Memory-constrained at scale

The Recall-Latency Trade-off

Every ANN (Approximate Nearest Neighbor) index has a knob that trades recall for latency:

HNSW: efSearch (larger = more nodes explored = higher recall = slower)
IVF: nprobe (more clusters searched = higher recall = slower)
PQ: nbits and number of subspaces (more precise = higher recall = more memory)

There is no free lunch. If someone claims 99% recall with sub-millisecond latency on 1 billion vectors, check their benchmarks carefully.

Common Mistakes

Vector databases replace traditional databases

Vector databases handle similarity search. They do not support transactions, JOINs, constraints, or SQL. You still need a relational database for structured data. The typical architecture has both: PostgreSQL for user data, vector DB for embeddings, linked by ID.

More dimensions are always better

Higher-dimensional embeddings capture more nuance but suffer from the curse of dimensionality: distances between points become more uniform as dimensions increase, making nearest-neighbor search less meaningful. Most practical systems use 256-768 dimensions. Going to 4096 rarely improves quality and significantly increases storage and computation.

Exact nearest neighbor is always better than approximate

In the context of embeddings, the vectors themselves are approximations (the embedding model is not perfect). The difference between the true nearest neighbor and the approximate nearest neighbor is usually smaller than the noise in the embeddings. Spending 1000x more compute for exact KNN rarely changes the end result.

Just use Faiss

Faiss is a library, not a database. It has no persistence, no replication, no access control, no API server. You need to build all of that yourself. For production systems, use a vector database that wraps Faiss (Milvus) or provides similar functionality (Qdrant, Weaviate).

Vector search handles filtering automatically

"Find the 10 nearest restaurants that are open now and have > 4 stars" is not a pure vector search. It requires combining vector similarity with metadata filtering. Naive approaches (filter then search, or search then filter) both have problems. Pre-filtering may exclude the nearest vectors. Post-filtering may not return K results. Modern vector databases support hybrid search (filtered ANN), but the performance characteristics vary.

Embeddings are static

When you retrain or update your embedding model, all existing vectors must be re-embedded. The old vectors and new vectors are in different spaces and cannot be compared. This re-indexing cost is significant at scale and must be planned for.