Log-Structured Storage — Why Kafka Is Fast

TL;DR

Kafka is fast because it does less -- append-only writes, sequential disk I/O, zero-copy transfers, OS page cache instead of JVM heap, and batch everything -- not because of any single magic trick.

What It Is

Log Structured Storage

Kafka is a distributed commit log. Not a message queue (though people use it as one). Not a database (though people store data in it). A log.

A log is an ordered, append-only sequence of records. New records go at the end. Old records stay where they are. You never update in place. You never delete individual records. You read by position (offset).

That's it. This simplicity is what makes Kafka absurdly fast.

LinkedIn built Kafka in 2010 because their existing message queue couldn't handle the volume. They needed to process hundreds of thousands of messages per second with sub-second latency. Every commercial message broker they tried collapsed under the load. So they built one that treats messages as entries in a log file. It worked.

Today, a single Kafka broker routinely handles 100K+ messages per second. Large clusters at LinkedIn, Uber, and Netflix push millions per second. The architecture hasn't fundamentally changed since 2010. That tells you the original design was sound.

The Log: Append-Only, Immutable

Every Kafka topic is divided into partitions. Each partition is a log. Each log is a sequence of segments on disk.

Partition 0:
  Segment 0:  [offset 0] [offset 1] [offset 2] ... [offset 999]
  Segment 1:  [offset 1000] [offset 1001] ... [offset 1999]
  Segment 2:  [offset 2000] [offset 2001] ...  (active segment)

Segment Files

Each segment is a pair of files:

.log file: Contains the actual message data, appended sequentially.
.index file: Maps offsets to physical positions in the .log file.

The active segment receives new writes. When it reaches a size threshold (default: 1GB) or time threshold (default: 7 days), Kafka rolls to a new segment. Old segments are immutable.

Why Append-Only Is Fast

Traditional databases do random writes. Update row 47, then row 1,000,003, then row 12. The disk head jumps around. Even on SSDs, random writes are slower than sequential writes because of write amplification and garbage collection.

Kafka only appends. Every write goes to the end of the file. On spinning disks, sequential writes are 100x faster than random writes because the disk head doesn't move. On SSDs, sequential writes are still 2-4x faster because they minimize write amplification.

Spicy take: Kafka's performance comes from not being clever. No B-trees. No write-ahead logs with checkpoints. No buffer pool management. Just append to a file and let the OS handle the rest. Most database engineers would call this design naive. They'd be wrong.

Zero-Copy: The sendfile() Trick

When a consumer reads data from Kafka, here's what a naive implementation does:

1. Read data from disk into kernel buffer      (disk → kernel)
2. Copy from kernel buffer to application      (kernel → user space)
3. Copy from application to socket buffer      (user space → kernel)
4. Send from socket buffer to network          (kernel → NIC)

Four copies. Two context switches between kernel and user space. Slow.

Kafka uses sendfile() (or transferTo() in Java), which does this:

1. Read data from disk into kernel buffer      (disk → kernel)
2. Send directly from kernel buffer to NIC     (kernel → NIC)

Two copies. Zero user-space involvement. The data never touches Kafka's JVM process. It goes straight from the OS page cache to the network card.

This is called zero-copy and it's available on Linux and most Unix systems. It's not Kafka-specific -- any application can use sendfile(). But Kafka's architecture is designed to make zero-copy possible. Messages are stored on disk in the same format they're sent over the network. No serialization, no transformation, no per-message processing. Just raw bytes from disk to wire.

Why This Matters at Scale

A Kafka broker serving 10 consumers reading the same topic at 50MB/s each would normally need to copy 500MB/s through the JVM heap. With zero-copy, the JVM does almost nothing. The OS handles it. Kafka brokers are often CPU-idle even at high throughput. The bottleneck moves to disk I/O and network bandwidth -- exactly where it should be.

Page Cache: Let the OS Do Its Job

Most databases manage their own memory. PostgreSQL has shared buffers. MySQL has the InnoDB buffer pool. They allocate a fixed chunk of RAM and manage caching themselves.

Kafka doesn't do this. It writes to files and relies entirely on the OS page cache.

How Page Cache Works

When Kafka writes a message to a segment file:

The write goes to the OS page cache (RAM).
The OS flushes dirty pages to disk asynchronously.
When a consumer reads that message, if it's still in the page cache, the read is served from RAM. No disk I/O.

For real-time consumers (reading messages seconds after they're produced), the data is almost always in the page cache. The producer writes it, the page cache holds it, the consumer reads it from memory. Disk is never touched for the read path.

Why Not Use JVM Heap?

Two reasons:

Garbage collection. A 32GB JVM heap with millions of cached messages means GC pauses. Long GC pauses mean the broker stops responding. Other brokers think it's dead and trigger partition reassignment. Cascading failure.

Double caching. If Kafka cached messages in the JVM heap AND the OS cached the same data in the page cache, you'd store everything twice. Wasted RAM.

By using only page cache, Kafka avoids GC entirely for data handling. The JVM heap stays small (4-8GB is typical). The rest of the machine's RAM is available for page cache. A broker with 64GB RAM might have 6GB for JVM and 58GB for page cache. That 58GB acts as a transparent read cache.

This is why Kafka restarts are fast. When a Kafka broker restarts, it doesn't need to warm up a cache. The OS page cache persists across process restarts. The data is already in RAM. The broker starts serving reads immediately.

Batching: Amortize Everything

Kafka batches at every layer. This is where the throughput numbers come from.

Producer Batching

The producer collects messages in a buffer before sending them to the broker. Two settings control this:

# Send when batch reaches this size (bytes)
batch.size=16384

# Or when this much time passes (milliseconds)
linger.ms=5

With linger.ms=5, the producer waits up to 5 milliseconds to fill a batch before sending. This converts many small network calls into one large one. The overhead of a network round trip is amortized across hundreds or thousands of messages.

Broker Batching

The broker writes batches to disk as a unit. One disk write for a batch of 1000 messages vs 1000 separate writes. Orders of magnitude difference.

Important nuance: By default, Kafka does NOT call fsync per batch — it relies on replication for durability and lets the OS flush page cache to disk asynchronously. You can configure log.flush.interval.messages to force fsync, but most production deployments leave this disabled.

Consumer Batching

The consumer fetches in batches too:

# Minimum bytes to fetch (wait until this much data is available)
fetch.min.bytes=1

# Maximum time to wait for fetch.min.bytes
fetch.max.wait.ms=500

# Maximum bytes per partition per fetch
max.partition.fetch.bytes=1048576

One network round trip returns many messages. The consumer processes them in memory without additional I/O.

The Batching Trade-Off

More batching = higher throughput, higher latency. Less batching = lower throughput, lower latency.

For logging and analytics, set linger.ms=100 and batch aggressively. Nobody cares if a log line is 100ms late.

For real-time payments, set linger.ms=0 and accept lower throughput. Latency matters more.

Uber tunes these settings differently for their pricing pipeline (low latency) vs their analytics pipeline (high throughput). Same Kafka cluster, different producer configurations.

Compression: Shrink the Batch

Kafka compresses at the batch level, not per message. This gives much better compression ratios because similar messages share patterns.

Algorithm	Compression Ratio	CPU Cost	Best For
`none`	1:1	Zero	CPU-bound workloads
`snappy`	~2:1	Very low	Default choice, low latency
`lz4`	~2.5:1	Low	Better ratio than snappy, still fast
`zstd`	~3:1	Medium	Best ratio, worth it for high-volume topics
`gzip`	~3:1	High	Avoid -- slow compression, similar ratio to zstd

How Compression Flows

Producer: compress batch → send compressed batch to broker
Broker:   store compressed batch as-is (no decompression!)
Consumer: fetch compressed batch → decompress

The broker never decompresses. It stores compressed bytes and serves compressed bytes. This saves disk space AND network bandwidth AND disk I/O. The CPU cost of compression and decompression is paid by producers and consumers, not brokers.

Spicy take: If you're running Kafka without compression enabled, you're wasting 50-70% of your disk and network capacity. There's almost no reason to use compression.type=none in production. Set it to lz4 and move on.

Putting It All Together: Why Kafka Is Fast

No single trick makes Kafka fast. It's the combination:

Sequential writes (append-only log)
   + Zero-copy (sendfile, no JVM involvement)
   + Page cache (OS handles caching, no GC)
   + Batching (amortize network + disk overhead)
   + Compression (reduce I/O at batch level)
   = Throughput that embarrasses traditional message brokers

Benchmark Context

Rough numbers for a single Kafka broker (3x replication, modern hardware):

Metric	Value
Write throughput	100K-200K messages/sec
Read throughput	300K+ messages/sec (serving from page cache)
Data throughput	100-200 MB/sec writes, 300+ MB/sec reads
Latency (p99)	5-15ms (end-to-end, producer to consumer)
Latency with `acks=all`	15-30ms (waits for replication)

These numbers depend on message size, replication factor, compression, hardware, and network. But they're in the right ballpark for interview discussions.

RabbitMQ, by comparison, tops out at roughly 20K-50K messages/sec on similar hardware. SQS is even slower. Kafka is 5-10x faster because of the architectural decisions described above.

Patterns for System Design Interviews

Pattern 1: High-Throughput Event Ingestion

"We use Kafka as the entry point for all events -- clicks, page views, API calls. Producers batch with linger.ms=50 and compression.type=lz4. A single Kafka cluster handles 500K events/sec. Consumers read from Kafka and write to different sinks: Elasticsearch for search, S3 for archival, Flink for real-time aggregation."

Pattern 2: Buffer Between Microservices

"Kafka decouples the order service from inventory, billing, and notification services. If the billing service goes down, messages accumulate in Kafka. When it recovers, it processes the backlog. No data loss. No retry logic in the order service."

Pattern 3: Replay and Reprocessing

"Because Kafka retains messages (configurable retention, default 7 days), we can replay events when we deploy a new consumer version. Reset the consumer offset to midnight, reprocess the day's events, compare output with the old version. This is impossible with traditional queues that delete messages after consumption."

Zero Copy

Trade-Offs Table

Factor	Kafka	Traditional Message Queue (RabbitMQ/SQS)
Throughput	100K+ msg/sec per broker	10K-50K msg/sec
Latency	5-15ms p99	1-5ms p99 (lower for small messages)
Message retention	Configurable (hours to forever)	Until consumed (then deleted)
Replay	Yes (seek to any offset)	No (message gone after ack)
Ordering	Per-partition guaranteed	Per-queue (single consumer)
Consumer model	Pull (consumer controls pace)	Push (broker sends to consumer)
Operational complexity	High (ZooKeeper/KRaft, brokers, topics)	Low (managed service, simple config)
Message routing	Topic + partition key	Exchange + routing key + bindings
Priority queues	Not supported	Supported (RabbitMQ)
Dead letter queues	Manual (write to DLQ topic)	Built-in (RabbitMQ, SQS)

Interview Gotchas

"Why not just use a database?"

A database is optimized for random reads and writes with indexes. Kafka is optimized for sequential writes and sequential reads. If you wrote events to a PostgreSQL table, you'd need B-tree index maintenance on every insert, random reads for different consumers at different positions, and manual cleanup of old events. Kafka's log structure avoids all of this.

"Why not use Kafka for everything?"

Kafka excels at streaming data between systems. It's terrible for request-reply patterns (use HTTP or gRPC), priority queues (use RabbitMQ), or small-scale messaging (use SQS -- zero ops overhead). Using Kafka to send 10 messages per hour is like using a semi-truck to deliver a pizza.

"What happens when a broker dies?"

Partitions are replicated across brokers. If a broker dies, the controller detects it (via ZooKeeper or KRaft) and elects new leaders for the partitions that were on the dead broker. Producers and consumers reconnect to the new leaders. Data is not lost because replicas have copies. Recovery is automatic. Downtime is typically seconds.

"Does Kafka guarantee no message loss?"

With acks=all (producer waits for all in-sync replicas to acknowledge), replication.factor=3, and min.insync.replicas=2, Kafka guarantees no data loss as long as at least 2 of 3 replicas are alive. With acks=1, the primary can die after acknowledging but before replicating -- message lost. With acks=0, the producer doesn't wait at all -- fastest, but no guarantees.

"Why does Kafka use pull instead of push?"

Pull lets consumers control their pace. A slow consumer doesn't cause backpressure on the broker. A fast consumer can read as fast as the network allows. With push, the broker must track consumer speed and buffer messages when consumers are slow. Pull is simpler and scales better.

Summary

Kafka's speed comes from five design decisions that reinforce each other: append-only log for sequential disk I/O, zero-copy for efficient data transfer, OS page cache instead of managed heap, batching to amortize overhead, and batch-level compression. None of these are novel techniques. The insight was combining them into a storage engine purpose-built for streaming data. Every system design interview involving event streaming, log aggregation, or inter-service communication should reference these fundamentals.