Partitions, Consumer Groups, and Ordering Guarantees

TL;DR

Kafka guarantees message order within a partition but not across partitions -- every design decision about partition keys, consumer groups, and parallelism flows from this one fact.

What It Is

Partition Key Ordering

A Kafka topic is a named stream of records. But a topic is just a logical concept. The real unit of work is the partition.

A partition is an ordered, immutable sequence of messages. Each message in a partition gets a unique, monotonically increasing offset. Partitions are spread across brokers. They're the unit of parallelism, the unit of replication, and the unit of consumer assignment.

If you remember one thing from this lesson: partition = unit of everything in Kafka.

Topics and Partitions

When you create a topic, you specify the number of partitions:

kafka-topics.sh --create \
  --topic orders \
  --partitions 12 \
  --replication-factor 3

This creates 12 partitions, each replicated to 3 brokers.

Topic: orders
  Partition 0:  Broker 1 (leader), Broker 2, Broker 3
  Partition 1:  Broker 2 (leader), Broker 3, Broker 1
  Partition 2:  Broker 3 (leader), Broker 1, Broker 2
  ...
  Partition 11: Broker 2 (leader), Broker 1, Broker 3

Partitions are distributed across brokers so that no single broker holds all partitions for a topic. This distributes load evenly.

Partition Keys: Controlling Where Messages Land

When a producer sends a message, it can include a key:

producer.send("orders", key=b"user-123", value=b'{"item": "widget"}')

Kafka routes the message to a partition using:

partition = hash(key) % num_partitions

All messages with the same key go to the same partition. Always. This is how you get ordering for related messages.

No Key = Round-Robin

If you don't provide a key, Kafka distributes messages across partitions using a sticky partitioner (batch to one partition, then switch). Messages are evenly distributed but have no ordering relationship.

Key Design Is Schema Design

Choosing the right partition key is as consequential as choosing a database shard key. The key determines:

Ordering -- messages with the same key are ordered.
Locality -- messages with the same key are on the same partition.
Parallelism -- one consumer per partition, so all messages for a key go to one consumer.
Hot partitions -- if one key has 80% of the traffic, one partition (and one consumer) gets 80% of the work.

Partition Key	Ordering Guarantee	Distribution	Best For
`user_id`	All events for a user are ordered	Even (if many users)	User activity streams
`order_id`	All events for an order are ordered	Even	Order state machines
`null` (no key)	No ordering	Round-robin, even	Logs, metrics (order doesn't matter)
`country`	Per-country ordering	Skewed (US dominates)	Bad choice -- hot partition
`sensor_id`	Per-sensor ordering	Even (if many sensors)	IoT data

Spicy take: 90% of Kafka production issues trace back to a bad partition key. Either the key has low cardinality (hot partition), or the developer forgot to set a key and wonders why messages arrive out of order.

Ordering Guarantees

This is the most misunderstood aspect of Kafka.

What Kafka Guarantees

Messages within the same partition are ordered by offset. Producer sends A, B, C. Consumer reads A, B, C. Always.
Messages across different partitions have no ordering. Producer sends A to partition 0 and B to partition 1. Consumer might read B before A.

The Wall-Clock Fallacy

"But message A has an earlier timestamp than message B!"

Doesn't matter. Timestamps don't determine ordering. Offsets do. And offsets are per-partition. There's no global offset across partitions.

Example: Order State Machine

An order goes through states: CREATED -> PAID -> SHIPPED -> DELIVERED.

If these events go to different partitions, a consumer might process SHIPPED before PAID. The state machine breaks.

Fix: use order_id as the partition key. All state transitions for the same order go to the same partition. Order is preserved.

# All events for order-456 go to the same partition
producer.send("order-events", key=b"order-456", value=b'{"status": "CREATED"}')
producer.send("order-events", key=b"order-456", value=b'{"status": "PAID"}')
producer.send("order-events", key=b"order-456", value=b'{"status": "SHIPPED"}')

When You Need Global Ordering

If you truly need global ordering across all messages, use one partition. This limits you to one consumer (no parallelism) and one broker's throughput. That's the trade-off.

Stripe processes payment events with ordering per merchant (partition key = merchant_id). They don't need global ordering across all merchants. A payment for Merchant A and a payment for Merchant B are independent. Different partitions, different consumers, full parallelism.

Consumer Groups

A consumer group is a set of consumers that cooperate to read from a topic.

The Core Rule

Each partition is assigned to exactly one consumer within a group. But one consumer can read from multiple partitions.

Topic: orders (6 partitions)
Consumer Group: order-processing

  Consumer A: reads Partition 0, 1
  Consumer B: reads Partition 2, 3
  Consumer C: reads Partition 4, 5

Three consumers, six partitions, each consumer handles two partitions. Add a fourth consumer:

  Consumer A: reads Partition 0, 1
  Consumer B: reads Partition 2, 3
  Consumer C: reads Partition 4
  Consumer D: reads Partition 5

Add a seventh consumer? It sits idle. Six partitions, six consumers max. The seventh consumer has nothing to read.

This is why partition count limits parallelism. If you create a topic with 3 partitions, you can never have more than 3 active consumers in a group. Plan your partition count based on your expected maximum consumer count.

Multiple Consumer Groups

Different consumer groups read the same topic independently. Each group maintains its own offsets.

Topic: orders (6 partitions)

Consumer Group "order-processing":
  Consumer A: Partitions 0, 1, 2
  Consumer B: Partitions 3, 4, 5

Consumer Group "analytics":
  Consumer X: Partitions 0, 1, 2, 3, 4, 5

The order-processing group and the analytics group both read all messages. They don't interfere with each other. This is how Kafka enables the "write once, read many" pattern.

Rebalancing

When a consumer joins or leaves a group, Kafka rebalances -- reassigns partitions across the remaining consumers.

Triggers

Consumer joins the group (new instance deployed).
Consumer leaves the group (crash, shutdown, network issue).
Consumer fails to send a heartbeat within session.timeout.ms (default: 45 seconds).
Consumer takes too long to process a batch (exceeds max.poll.interval.ms, default: 5 minutes).
Topic gets new partitions.

What Happens During Rebalancing

All consumers in the group stop processing.
The group coordinator (a Kafka broker) collects partition assignments.
The coordinator computes new assignments using the configured strategy.
Each consumer gets its new partition assignment.
Consumers resume processing.

The problem: during rebalancing, the entire group stops. For a group with 20 consumers, a single consumer restart pauses processing for all 20. This can last seconds to minutes.

Rebalancing Strategies

Strategy	How It Works	Pros	Cons
Range	Assigns consecutive partitions to each consumer	Simple	Uneven distribution with few partitions
RoundRobin	Distributes partitions one at a time across consumers	Even distribution	All partitions reassigned on change
Sticky	Minimizes partition movement during rebalance	Fewer reassignments	Slightly more complex
CooperativeSticky	Incremental rebalance -- only affected partitions stop	Minimal disruption	Requires all consumers to support it

Use CooperativeSticky (Kafka 2.4+). It's strictly better than the alternatives. Only the partitions that need to move are reassigned. Other partitions continue processing uninterrupted.

Avoiding Unnecessary Rebalances

# Increase session timeout -- don't trigger rebalance for brief GC pauses
session.timeout.ms=45000

# Increase poll interval -- give slow consumers more time
max.poll.interval.ms=300000

# Set a static group member ID -- survive restarts without rebalancing
group.instance.id=consumer-host-1

Static group membership (Kafka 2.3+) is a game-changer. Assign a fixed group.instance.id to each consumer. When a consumer restarts, Kafka recognizes it as the same member and reassigns the same partitions without triggering a full rebalance.

DoorDash reduced their rebalancing incidents by 80% after switching to static group membership. Their order processing pipeline couldn't afford the 30-second pauses that full rebalances caused during peak hours.

Offsets: Tracking Progress

Each consumer tracks its position in each partition using offsets. The offset is the index of the next message to read.

Where Offsets Are Stored

Offsets are stored in a special Kafka topic: __consumer_offsets. Each consumer group writes its current offsets to this topic. When a consumer restarts or a partition is reassigned, the new consumer reads the last committed offset and resumes from there.

Commit Strategies

Auto-Commit

enable.auto.commit=true
auto.commit.interval.ms=5000

The consumer automatically commits offsets every 5 seconds. Simple, but dangerous.

The problem: the consumer reads messages, starts processing them, and the auto-commit fires before processing finishes. If the consumer crashes, the offset was already committed. Those messages are "consumed" but never processed. Data loss.

Manual Commit (Synchronous)

for message in consumer:
    process(message)
    consumer.commit()  # blocks until offset is committed

Commit after processing. If the consumer crashes before committing, the message is reprocessed on restart. At-least-once delivery. Safe.

The downside: one commit per message is slow. Commit in batches instead:

batch = consumer.poll(timeout_ms=1000)
for message in batch:
    process(message)
consumer.commit()  # commit once per batch

Manual Commit (Asynchronous)

consumer.commit_async(callback=on_commit_complete)

Non-blocking. Higher throughput. But if the async commit fails and you don't retry, you might reprocess messages.

The Offset Reset Question

What happens when a consumer starts and there's no committed offset? The auto.offset.reset setting decides:

Setting	Behavior	Use Case
`earliest`	Start from the beginning of the partition	Process all historical data
`latest`	Start from the latest message	Skip history, only process new data
`none`	Throw an error	Fail explicitly, don't guess

For new consumer groups processing critical data, use earliest. For monitoring/alerting consumers that only care about new events, use latest.

Replication: Fault Tolerance

Each partition is replicated to multiple brokers.

Leader and Followers

One replica is the leader. All reads and writes go through the leader. Other replicas are followers. They replicate data from the leader by fetching from its log.

ISR (In-Sync Replicas)

The ISR is the set of replicas that are "caught up" with the leader. A replica stays in the ISR if it replicates within replica.lag.time.max.ms (default: 30 seconds).

If a follower falls behind (slow disk, network issue), it's removed from the ISR. When it catches up, it rejoins.

Write Acknowledgment

The acks producer setting interacts with ISR:

acks	What It Means	Durability	Latency
`0`	Don't wait for any ack	None	Lowest
`1`	Leader ack	Survives follower failure	Low
`all`	All ISR ack	Survives any single failure	Highest

With acks=all and min.insync.replicas=2 on a topic with replication.factor=3:

All 3 replicas alive: write succeeds after 2 acks (ISR = 3, min = 2).
1 replica down: write succeeds after 2 acks (ISR = 2, min = 2).
2 replicas down: write fails (ISR = 1, min = 2). The broker returns NotEnoughReplicas.

This is the correct production configuration for critical data. It guarantees that at least 2 copies of every message exist before acknowledging.

Unclean Leader Election

If all ISR replicas die and only an out-of-sync replica remains, should it become the leader?

unclean.leader.election.enable=true: Yes. Data loss is possible (the out-of-sync replica may be missing recent messages). But the partition is available.
unclean.leader.election.enable=false (default since Kafka 0.11): No. The partition is unavailable until an ISR replica recovers. No data loss.

Interview shortcut: "We disable unclean leader election because we prefer unavailability over data loss."

Partition Count: The One-Way Door

You can increase partition count. You cannot decrease it. This is a one-way door.

Choosing the Right Partition Count

Factor	Guidance
Consumer parallelism	At least as many partitions as your maximum consumer count
Throughput	More partitions = more parallel I/O. But diminishing returns past 20-30 per broker.
Ordering	More partitions = weaker global ordering. Use partition keys.
End-to-end latency	More partitions = more replication overhead. Slight latency increase.
Broker memory	Each partition uses ~10KB of broker memory for metadata. 10K partitions = 100MB.
Recovery time	More partitions = longer leader election during broker failure

Rule of thumb: Start with max(expected_consumers * 2, expected_throughput_MB / 10). For most topics, 6-24 partitions is the sweet spot. Don't create 1000 partitions "just in case."

What Happens When You Add Partitions

Existing messages stay in their current partitions. New messages are distributed across all partitions (including new ones). But here's the gotcha: if you use partition keys, adding partitions changes the key-to-partition mapping.

hash("user-123") % 12 gives a different result than hash("user-123") % 16. Messages for user-123 that were in partition 3 now go to partition 7. Ordering is broken for in-flight data.

This is why you should over-provision partitions slightly at creation time. Adding partitions later requires careful coordination.

Patterns for System Design Interviews

Pattern 1: Event-Driven Order Processing

"Topic orders with partition key order_id. Each order's events (created, paid, shipped, delivered) land in the same partition, preserving order. Consumer group order-processor with one consumer per partition. Static group membership to avoid rebalancing during deployments."

Pattern 2: Multi-Tenant Activity Feed

"Topic user-activity with partition key tenant_id. All events for a tenant land on the same partition. Each tenant's events are processed in order. Different consumer groups: one for real-time notifications, one for analytics, one for audit logging."

Pattern 3: Fan-Out to Multiple Services

"Topic payments with 3 consumer groups: fraud-detection, accounting, notifications. Each group reads all messages independently. Fraud detection needs low latency (small fetch.max.wait.ms). Accounting needs reliability (enable.auto.commit=false, manual commit after processing). Notifications tolerate occasional duplicates."

Topic Partition Consumer

Trade-Offs Table

Factor	Few Partitions (1-3)	Many Partitions (50+)
Ordering	Stronger (fewer partition boundaries)	Weaker (must rely on keys)
Parallelism	Limited consumers	High parallelism
Throughput	Limited by single partition I/O	High aggregate throughput
Rebalancing	Fast (few assignments)	Slow (many assignments)
Broker memory	Minimal	Noticeable at 10K+
Recovery time	Fast leader election	Slower leader election
Key distribution	More prone to hot partitions	Better key distribution
Operational complexity	Low	Higher (monitoring, tuning)

Interview Gotchas

"How do you guarantee ordering in Kafka?"

"Order is guaranteed within a partition. To order related messages, use a partition key that groups them. For example, all events for order-123 use order-123 as the key. They land in the same partition and are consumed in order. We cannot guarantee ordering across partitions."

"What happens when a consumer is slow?"

If a consumer takes longer than max.poll.interval.ms to process a batch, Kafka assumes it's dead and triggers a rebalance. Its partitions get reassigned to other consumers. The slow consumer then tries to commit offsets for partitions it no longer owns -- commit fails. Fix: increase max.poll.interval.ms or reduce batch size.

"Can two consumers read the same partition?"

Not within the same consumer group. But different consumer groups can. This is how Kafka enables multiple independent consumers on the same data.

"What's the difference between Kafka partitions and database shards?"

Conceptually similar -- both split data across nodes. But Kafka partitions are append-only and immutable. Database shards support random reads, updates, and deletes. Kafka partitions are optimized for sequential access. Database shards are optimized for random access.

"How do you handle partition skew?"

If one partition key has disproportionate traffic (celebrity user, large tenant), that partition becomes a hot spot. Solutions: use a compound key (tenant_id + sub_key), hash the key for even distribution, or dedicate a separate topic for high-volume producers.

Summary

Partitions are the heart of Kafka's architecture. They determine parallelism, ordering, and fault tolerance. The partition key choice is a one-way decision that affects every consumer. Consumer groups enable parallel processing with automatic partition assignment, but rebalancing is expensive -- use CooperativeSticky and static membership. Offsets track consumer progress -- commit manually for at-least-once semantics. Replication with ISR keeps data safe. Get these fundamentals right and the rest of Kafka falls into place.