LSM Trees vs B-Trees — Write-Optimized vs Read-Optimized

TL;DR

B-trees update data in place. LSM trees append data and sort it later. B-trees are read-optimized: one disk seek to find any key. LSM trees are write-optimized: sequential writes are 100x faster than random writes on SSDs and 1000x faster on HDDs. PostgreSQL and MySQL use B-trees. Cassandra, RocksDB, and LevelDB use LSM trees. The choice comes down to your read/write ratio: if writes dominate, use LSM; if reads dominate, use B-trees. Every database you will ever discuss in a system design interview uses one of these two.

The Problem

Disks -- even SSDs -- have a fundamental asymmetry: sequential writes are vastly faster than random writes. An SSD can do ~500K random writes per second, but ~2M sequential writes per second. On an HDD, the gap is even larger: ~200 random writes/sec vs ~100K sequential writes/sec.

A B-tree does random writes. To update a key, it reads the page containing the key, modifies it, and writes the page back. Each update touches a different page at a different location on disk. This is fine for read-heavy workloads: the B-tree structure ensures any key can be found in O(log n) page reads. But for write-heavy workloads, the random I/O becomes a bottleneck.

LSM trees flip the trade-off: all writes go to an in-memory buffer and are flushed to disk as large sequential writes. Reads become more expensive because data might be spread across multiple files on disk. The design question is which trade-off matches your workload.

The Algorithm: B-Trees

Structure

A B-tree is a self-balancing tree where each node contains multiple keys and child pointers, sized to fit a single disk page (typically 4-16 KB).

                        [30 | 60]
                       /    |    \
              [10 | 20]  [40 | 50]  [70 | 80 | 90]
              / | \      / | \       /  |  |   \
            leaf nodes (actual data or pointers to data)

Properties: - Each node has at most B keys (B = branching factor, typically 100-500 for 4KB pages). - Each node has at least B/2 keys (except root). - All leaves are at the same depth. - A B-tree with n keys has depth O(log_B(n)).

Practical Depth

With B = 500 and n = 1 billion:

depth = log_500(1,000,000,000) ≈ 3.3 → 4 levels

Four levels means any key can be found with at most 4 disk reads. In practice, the top 2-3 levels are cached in memory, so most lookups require 1-2 disk reads. This is why B-trees have dominated database indexing for 50 years.

Operations

Search: Start at root, binary search within node, follow child pointer, repeat. O(log_B(n)) page reads.

Insert: Search for the correct leaf, insert key. If the leaf is full, split it and push the middle key up. Splits can cascade to the root. O(log_B(n)) page reads + O(log_B(n)) page writes in worst case.

Update: Search for the key, modify the value in the page, write the page back. O(log_B(n)) page reads + 1 page write.

Write-Ahead Log (WAL)

B-tree updates are in-place. If the system crashes during a page write, the page could be corrupted. To prevent this, B-tree databases use a write-ahead log (WAL): before modifying a page, write the intended change to a sequential log file. If the system crashes, replay the WAL to recover.

This means every update results in at least two writes: one to the WAL (sequential) and one to the B-tree page (random). This is the write amplification cost of B-trees.

The Algorithm: LSM Trees

Structure

An LSM tree has two components:

MemTable: An in-memory sorted structure (typically a skip list or red-black tree). All writes go here first.
SSTables (Sorted String Tables): Immutable, sorted files on disk. When the MemTable is full, it is flushed to disk as a new SSTable.

Write Path

1. Client writes key-value pair.
2. Append to WAL (for crash recovery).
3. Insert into MemTable (in-memory, fast).
4. When MemTable reaches threshold (e.g., 64 MB):
   a. Convert MemTable to immutable.
   b. Start a new empty MemTable for incoming writes.
   c. Flush immutable MemTable to disk as an SSTable (one large sequential write).

The key insight: writes never touch existing files on disk. They always create new files. This makes writes purely sequential.

Read Path

1. Check MemTable (most recent data).
2. Check each SSTable from newest to oldest.
3. Return the first match found.

Without optimization, reads are expensive: you might check dozens of SSTables. Three optimizations make reads practical:

Bloom filter per SSTable: Skip SSTables that definitely do not contain the key. (Covered in the Bloom filter lesson.)
Sparse index: An in-memory index sampling every Nth key in each SSTable. Binary search the index, then scan a small range in the SSTable.
Compaction: Merge SSTables to reduce the number of files.

Compaction

SSTables accumulate over time. Old values and deleted keys (marked with tombstones) waste space and slow reads. Compaction merges multiple SSTables into one, discarding old values and tombstones.

Size-Tiered Compaction (Cassandra default):

Group SSTables by size tier.
When a tier has enough SSTables (e.g., 4), merge them into one larger SSTable.

Tier 0: [4MB] [4MB] [4MB] [4MB]  → merge → [16MB] in Tier 1
Tier 1: [16MB] [16MB] [16MB] [16MB] → merge → [64MB] in Tier 2

Pros: High write throughput, simple. Cons: Temporary space amplification (up to 2x during compaction), reads may check many SSTables.

Leveled Compaction (RocksDB default):

Level 0: Recent SSTables (unordered, overlapping key ranges).
Level 1: Non-overlapping SSTables, total size ~10 MB.
Level 2: Non-overlapping SSTables, total size ~100 MB.
Level 3: Non-overlapping SSTables, total size ~1 GB.
...each level is 10x the previous.

When Level L exceeds its size limit, one SSTable from Level L is merged with overlapping SSTables in Level L+1.

Pros: Reads check at most one SSTable per level (non-overlapping ranges). Bounded space amplification. Cons: Higher write amplification (each byte may be rewritten 10-30x across levels).

Worked Example: Write Then Read

LSM tree with 64 MB MemTable threshold.

Write operations (fast, all in-memory):
  PUT("user:1", "Alice")  → MemTable
  PUT("user:2", "Bob")    → MemTable
  PUT("user:1", "Alicia") → MemTable (overwrites previous value IN MEMORY)
  ...64 MB of writes...

MemTable flush:
  MemTable → SSTable-001 on disk (one sequential write, ~64 MB)
  New empty MemTable created.

More writes:
  PUT("user:3", "Charlie") → MemTable
  PUT("user:1", "Alex")    → MemTable (overwrites again)
  ...64 MB more...

MemTable flush:
  MemTable → SSTable-002 on disk.

Read("user:1"):
  1. Check MemTable. Not found (already flushed).
  2. Check SSTable-002 (newest). Found "user:1" = "Alex". Return.
  (SSTable-001 is never checked because we found it in 002.)

Read("user:2"):
  1. Check MemTable. Not found.
  2. Check SSTable-002. Bloom filter says NO. Skip.
  3. Check SSTable-001. Bloom filter says MAYBE. Read. Found "Bob". Return.

Proof/Correctness Intuition

Why LSM Writes Are Faster

All LSM writes are sequential: WAL append + eventual MemTable flush as a contiguous SSTable. No random I/O. On an SSD, sequential writes are 3-5x faster than random writes. On an HDD, 100-1000x faster. The MemTable absorbs the cost of sorting: inserts into a skip list or red-black tree are O(log n) in memory, which is nanoseconds compared to the milliseconds of a disk seek.

Why B-Tree Reads Are Faster

A B-tree stores each key in exactly one place. One lookup, one result. An LSM tree might have the same key in the MemTable, three SSTables, and a compaction buffer. The read path must check each one (newest first) and return the first match. Even with Bloom filters and sparse indices, an LSM read does more work than a B-tree read.

Real-World Usage

Engine	Structure	Used By
InnoDB	B-tree	MySQL, MariaDB, Percona Server
PostgreSQL	B-tree	PostgreSQL (default index type)
LevelDB	LSM	Bitcoin Core (block index), Chrome
RocksDB	LSM	CockroachDB, TiKV, YugabyteDB, MySQL (MyRocks)
Cassandra	LSM	Cassandra (SSTable engine)
HBase	LSM	Hadoop ecosystem, time-series workloads
WiredTiger	Both	MongoDB (default since 3.2, uses B-tree)
SQLite	B-tree	Embedded databases, mobile apps

RocksDB deserves special attention. Facebook built it as an optimized fork of LevelDB, and it has become the de facto embedded LSM engine. CockroachDB, TiKV, and YugabyteDB all use RocksDB as their storage layer. Understanding RocksDB's tuning knobs (write buffer size, compaction strategy, bloom filter bits per key) is directly applicable to these databases.

Interview Application

When to mention LSM vs B-tree:

"Design a time-series database." -- LSM tree. Writes are append-only, time-ordered. Perfect match.
"Design a chat messaging system." -- LSM tree for message storage (write-heavy), B-tree index for user lookups (read-heavy).
"Why does Cassandra perform well for writes but slower for reads?" -- LSM tree. Writes go to MemTable (fast). Reads may check multiple SSTables (slower).
"How does PostgreSQL index data?" -- B-tree. One lookup per read.

The decision framework:

Write-heavy workload (>80% writes)?  → LSM tree (Cassandra, RocksDB)
Read-heavy workload (>80% reads)?    → B-tree (PostgreSQL, MySQL)
Mixed workload?                       → B-tree (simpler, more predictable)
Time-series / append-only?            → LSM tree (natural fit)
Need strong transactional guarantees? → B-tree (simpler crash recovery)

What interviewers want to hear:

You understand the write amplification trade-off.
You know why sequential writes are faster than random writes.
You can name specific databases and their storage engines.
You can reason about when each is the right choice.

Trade-offs

Btree Page Split

The Three Amplifications

Every storage engine is a trade-off between three types of amplification:

Amplification	B-Tree	LSM Tree
Write	Low (1 WAL + 1 page write)	High (data rewritten during compaction, 10-30x)
Read	Low (1 page read per level)	Medium-High (check multiple SSTables)
Space	Low (1 copy per key)	Medium (temporary duplicates during compaction)

You cannot minimize all three simultaneously. This is a fundamental law of storage engine design.

Detailed Comparison

Aspect	B-Tree	LSM Tree
Write pattern	Random (in-place update)	Sequential (append + flush)
Read pattern	One lookup path	Multiple files to check
Write latency	Higher (disk seek)	Lower (memory + WAL)
Read latency	Lower (single path)	Higher (multiple SSTables)
Space efficiency	Better (no duplicates)	Worse (temporary duplicates)
Crash recovery	WAL replay	WAL replay + SSTable integrity
Compaction CPU cost	None	Significant (background merges)
Predictable latency	Yes	No (compaction stalls)

The Compaction Problem

LSM trees have a dirty secret: compaction. When the system is under heavy write load, compaction may fall behind. Uncompacted SSTables accumulate, reads slow down, and space usage grows. When compaction finally runs, it competes with foreground operations for disk I/O and CPU.

This manifests as latency spikes. The p99 latency of an LSM-based database can be 10-100x higher than the p50 during compaction. B-trees have more predictable latency because there is no background process competing for I/O.

Common Mistakes

LSM trees are always faster than B-trees

LSM trees have faster writes. B-trees have faster reads. The overall performance depends on your read/write ratio. For a read-heavy OLTP workload (like a banking system), a B-tree will outperform an LSM tree.

Write amplification only matters for SSDs

Write amplification matters for SSDs (limited write endurance) AND for throughput (every extra write consumes I/O bandwidth). On SSDs, excessive write amplification shortens the device's lifespan. On HDDs, it wastes the already-limited random I/O budget.

Compaction is free

Compaction consumes disk I/O, CPU, and temporary disk space. During compaction, the system reads multiple SSTables, merge-sorts them, and writes the result. This can saturate disk bandwidth and cause latency spikes for foreground operations. RocksDB has extensive tuning options for compaction precisely because it is a major operational concern.

LSM trees cannot support range queries

They can, but it is more expensive. A B-tree range query scans leaf pages sequentially (adjacent on disk). An LSM range query must merge results from the MemTable and all SSTables. With leveled compaction (non-overlapping key ranges per level), this is manageable. With size-tiered compaction, range queries can be slow.

Just pick RocksDB for everything

RocksDB is an embedded storage engine, not a database. It provides no network protocol, no query language, no replication. CockroachDB, TiKV, and YugabyteDB add those layers on top. Picking RocksDB means you are also committing to building or using all the layers above it.

Tombstones are deleted immediately

In an LSM tree, a delete writes a tombstone marker. The actual data is only removed during compaction when the tombstone and the original entry are in the same merge. Until then, both exist on disk. If compaction is slow, deleted data persists for a long time, consuming space and potentially appearing in range scans (though the read path filters it out).