Exactly-Once Semantics and Transactions

TL;DR

Kafka's exactly-once semantics are real but narrow -- they guarantee each message is processed exactly once within Kafka, but your downstream systems still need idempotent consumers to be truly safe.

What It Is

Delivery Semantics

Delivery semantics describe how many times a message gets processed. Three options exist. Most people only understand two of them, and misunderstand the third.

At-Most-Once

Commit the offset before processing the message. If the consumer crashes after committing but before finishing processing, the message is lost. When the consumer restarts, it picks up from the committed offset and skips the unprocessed message.

1. Receive message at offset 42
2. Commit offset 43  ← "I'm done with 42"
3. Start processing message
4. 💥 Consumer crashes
5. Restart → resume from offset 43
6. Message 42 is never processed

When it's acceptable: metrics collection, logging, monitoring. A missing data point every few hours doesn't break anything.

When it's not: payment processing, inventory updates, user notifications. Losing a single message means real business impact.

At-Least-Once

Commit the offset after processing the message. If the consumer crashes after processing but before committing, the message is reprocessed on restart.

1. Receive message at offset 42
2. Process message (write to database)
3. 💥 Consumer crashes before commit
4. Restart → resume from offset 42
5. Process message AGAIN (duplicate write to database)

This is the default for most Kafka consumers. It's safe -- you never lose data. But you might process the same message twice.

The fix: make your consumer idempotent. If processing a message twice produces the same result as processing it once, duplicates don't matter.

Idempotency patterns:

Database upsert: INSERT ... ON CONFLICT DO UPDATE -- same data written twice, same result.
Deduplication key: Store a (message_id, partition, offset) in a processed-messages table. Skip if already seen.
Conditional update: UPDATE balance SET amount = amount - 10 WHERE transaction_id = 'txn-123' AND NOT EXISTS (SELECT 1 FROM processed WHERE id = 'txn-123').

Exactly-Once

Each message is processed exactly once. No losses, no duplicates. Sounds perfect. But the details matter.

The Idempotent Producer

The first building block. Enabled with one setting:

enable.idempotence=true

Since Kafka 3.0, this is the default. You get it for free.

How It Works

When the producer starts, the broker assigns it a Producer ID (PID) and a sequence number starting at 0. Each message sent includes the PID and a per-partition sequence number.

Producer PID=7, sending to Partition 3:
  Message A: sequence=0
  Message B: sequence=1
  Message C: sequence=2

The broker tracks the last sequence number per (PID, partition). If it receives a message with a sequence number it's already seen, it rejects it as a duplicate. If it receives a sequence number that's too far ahead (gap), it rejects it as an out-of-order error.

What It Solves

Network retries. The producer sends message B, the broker writes it, the acknowledgment is lost. The producer retries. Without idempotency, the broker writes message B again -- duplicate. With idempotency, the broker sees sequence=1 already exists and returns a success without writing again.

What It Doesn't Solve

If the producer crashes and restarts, it gets a new PID. The sequence numbers reset. The broker treats it as a new producer. If the producer had sent a message right before crashing that was acknowledged by the broker but not by the producer, the restarted producer might send it again with a new PID. The broker can't detect this as a duplicate.

Idempotent producers prevent duplicates from retries. They don't prevent duplicates from producer restarts. That's what transactions are for.

Kafka Transactions

Transactions allow atomic writes across multiple partitions and topics.

The Read-Process-Write Pattern

This is the most common use case. A consumer reads from an input topic, processes the message, and writes to an output topic. The offset commit and the output write must happen atomically.

Without transactions:
  1. Read from input topic (offset 42)
  2. Process message
  3. Write result to output topic         ← could fail
  4. Commit offset 43 on input topic      ← could fail independently

If step 3 succeeds but step 4 fails, you reprocess message 42 and write a duplicate to the output topic. If step 4 succeeds but step 3 fails, you skip the message.

With transactions:
  1. Read from input topic (offset 42)
  2. Begin transaction
  3. Write result to output topic
  4. Commit offset 43 on input topic
  5. Commit transaction                   ← atomic: all or nothing

Either both the output write and the offset commit succeed, or neither does. No duplicates. No lost messages.

How Transactions Work

from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    transactional_id='order-processor-1',  # must be unique per producer instance
    enable_idempotence=True                 # required for transactions
)

producer.init_transactions()

try:
    producer.begin_transaction()

    # Write to output topic
    producer.send("processed-orders", key=b"order-123", value=b"result")

    # Commit consumer offsets as part of the transaction
    producer.send_offsets_to_transaction(
        offsets={TopicPartition("orders", 0): OffsetAndMetadata(43)},
        group_id="order-processor"
    )

    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

The Transaction Coordinator

Each Kafka broker can act as a transaction coordinator. The coordinator manages the state of active transactions. It writes transaction markers (COMMIT or ABORT) to the relevant partitions.

Transaction lifecycle:
  1. Producer registers with coordinator (via transactional.id)
  2. Producer begins transaction
  3. Producer sends messages to partitions (marked as "uncommitted")
  4. Producer requests commit
  5. Coordinator writes PREPARE marker to transaction log
  6. Coordinator writes COMMIT markers to all involved partitions
  7. Messages become visible to consumers with read_committed

The transactional.id persists across producer restarts. When a producer with the same transactional.id reconnects, the coordinator fences the old producer instance. This prevents zombie producers from committing transactions started by a crashed instance.

Spicy take: Kafka transactions are a two-phase commit protocol disguised as a streaming feature. The coordinator is the transaction manager. The partitions are the participants. It has the same failure modes as any 2PC system -- coordinator failure during the prepare phase can leave transactions in limbo. Kafka handles this with timeout-based resolution, but it's not magic.

Consumer Side: read_committed

Transactions are useless if consumers read uncommitted messages. The consumer needs to opt in:

isolation.level=read_committed

With read_committed:

The consumer only sees messages from committed transactions.
Aborted transaction messages are skipped.
Non-transactional messages are read normally.

Without read_committed (default is read_uncommitted), the consumer sees all messages, including ones from transactions that will be aborted. This defeats the purpose.

The LSO (Last Stable Offset)

The consumer won't read past the Last Stable Offset -- the offset of the earliest open (uncommitted) transaction. This means a long-running transaction can block consumer progress on that partition.

Partition offsets:
  [0] [1] [2] [3] [4] [5] [6] [7] [8]
                ^               ^
            Open txn         Latest message
            (uncommitted)

  LSO = 3. Consumer with read_committed stops here.
  Messages 3-8 are invisible until the transaction commits or aborts.

If a producer starts a transaction and crashes without committing, the LSO is stuck. The transaction eventually times out (transaction.timeout.ms, default 60 seconds), the coordinator aborts it, and the LSO advances.

Interview tip: "Long-running transactions block read_committed consumers. We keep transactions short -- read, process, write, commit. No multi-minute transactions."

End-to-End Exactly-Once

Putting it all together:

Exactly-Once = Idempotent Producer
             + Transactional Writes (atomic offset commit + output)
             + Consumer read_committed
             + Idempotent Processing Logic

Wait -- Idempotent Processing Too?

Yes. And this is the part most articles skip.

Kafka's exactly-once guarantees that a message is delivered and committed exactly once within Kafka. But what if your consumer writes to an external system?

1. Read message from Kafka
2. Write to PostgreSQL
3. Commit Kafka offset (as part of transaction)

If step 2 succeeds (PostgreSQL write committed) and step 3 fails (Kafka transaction aborted), the consumer retries. It reads the message again and writes to PostgreSQL again. Duplicate in PostgreSQL.

Kafka can't roll back a PostgreSQL write. It only controls what happens inside Kafka. External systems are outside the transaction boundary.

Fix: Make the PostgreSQL write idempotent. Use a unique constraint on the message ID. Use INSERT ... ON CONFLICT DO NOTHING. Or store the Kafka offset in the same PostgreSQL transaction as the business write, and check it before processing.

This is why some engineers call it "effectively once" rather than "exactly once." The Kafka side is exact. The end-to-end system needs application-level help.

When Exactly-Once Matters

Not always. The operational complexity and performance cost of transactions are real. Choose based on the use case.

Worth the Cost

Use Case	Why
Financial transactions	Double-charging a credit card is unacceptable
Ad click counting	Advertisers pay per click. Duplicates mean overcharging.
Inventory management	Double-decrementing stock means overselling
Account balance updates	Duplicate credits/debits corrupt balances

Uber uses exactly-once semantics for their driver payment pipeline. A duplicate payment event means paying a driver twice. The cost of a duplicate far exceeds the cost of the transaction overhead.

Not Worth the Cost

Use Case	Why At-Least-Once Is Fine
Logging	A duplicate log line is harmless
Metrics aggregation	Counters are approximate anyway
Search indexing	Re-indexing a document is idempotent
Notifications	A duplicate "your order shipped" email is mildly annoying, not catastrophic
CDC replication	Upserts are idempotent by nature

For these, at-least-once delivery with an idempotent consumer is simpler, faster, and cheaper. Don't pay the transaction tax when you don't need to.

Performance Impact

Transactions aren't free.

Factor	Without Transactions	With Transactions
Throughput	100%	~70-80%
Latency	Baseline	+5-15ms per transaction commit
Broker CPU	Normal	Higher (coordinator overhead)
Consumer lag	Real-time	Bounded by LSO (open transactions)
Complexity	Low	Medium (transactional.id management, fencing)

The throughput hit comes from the two-phase commit protocol and the synchronous nature of transaction commits. Batching multiple messages into one transaction amortizes the cost.

# Bad: one transaction per message (high overhead)
for msg in messages:
    producer.begin_transaction()
    producer.send("output", msg)
    producer.commit_transaction()

# Good: one transaction per batch (amortized overhead)
producer.begin_transaction()
for msg in batch:
    producer.send("output", msg)
producer.commit_transaction()

Patterns for System Design Interviews

Pattern 1: Exactly-Once Stream Processing

"Consumer reads from raw-events, transforms the data, writes to processed-events. Using transactional producer with read_committed consumers. Offset commit and output write are atomic. If the processor crashes, it resumes from the last committed offset without duplicating output."

Pattern 2: Idempotent External Writes

"Consumer reads payment events from Kafka. Writes to PostgreSQL with INSERT INTO payments (id, amount, status) VALUES (\$1, \$2, \$3) ON CONFLICT (id) DO NOTHING. The payment ID is derived from the Kafka message key. Duplicate processing attempts are no-ops. We use at-least-once delivery with idempotent writes instead of Kafka transactions because the external write can't participate in the Kafka transaction."

Pattern 3: Hybrid Approach

"For Kafka-to-Kafka processing (enrichment, filtering), we use exactly-once with transactions. For Kafka-to-database writes, we use at-least-once with idempotent consumers. This minimizes transaction overhead while maintaining correctness."

Anti-Pattern to Call Out

"We would NOT use exactly-once for our logging pipeline. Logging generates millions of messages per second. The 20-30% throughput hit from transactions would require adding brokers. A duplicate log line is harmless. At-least-once is the right choice here."

Idempotent Producer

Trade-Offs Table

Factor	At-Most-Once	At-Least-Once	Exactly-Once
Data loss	Possible	No	No
Duplicates	No	Possible	No (within Kafka)
Throughput	Highest	High	Lower (70-80%)
Latency	Lowest	Low	Higher (+5-15ms)
Complexity	Simple	Simple + idempotent consumer	Complex (transactions + fencing)
External writes	N/A	Needs idempotent consumer	Still needs idempotent consumer
Use case	Metrics, logs	Most applications	Financial, billing, inventory

Interview Gotchas

"Does Kafka guarantee exactly-once delivery?"

"Within Kafka, yes -- with idempotent producers, transactions, and read_committed consumers. But end-to-end exactly-once requires the consumer's external writes to be idempotent too. Kafka can't roll back a PostgreSQL INSERT. So it's exactly-once within the Kafka boundary and effectively-once end-to-end."

"What's the difference between idempotent producer and transactions?"

"Idempotent producers prevent duplicates from network retries on a single partition. Transactions provide atomic writes across multiple partitions and topics. You need idempotent producers for transactions, but you can use idempotent producers without transactions."

"What happens if the transaction coordinator crashes?"

"Another broker takes over as coordinator. Pending transactions are resolved based on the transaction log. If a transaction was in the PREPARE state, the new coordinator aborts it after transaction.timeout.ms. If it was already COMMITTED, the new coordinator completes the commit markers. No data loss."

"How do you handle zombie producers?"

"The transactional.id enables producer fencing. When a new producer instance registers with the same transactional.id, the coordinator increments the epoch. The old producer's requests are rejected with ProducerFenced error. This prevents a crashed producer that comes back to life from committing stale transactions."

"Why not just use exactly-once everywhere?"

"Performance cost (20-30% throughput), operational complexity (transactional.id management, LSO monitoring, fencing), and it doesn't help with external writes anyway. Most systems work perfectly with at-least-once delivery and idempotent consumers. Reserve exactly-once for the critical paths where duplicates have real business cost."

Summary

Kafka's exactly-once semantics are built on three layers: idempotent producers (prevent retry duplicates), transactions (atomic multi-partition writes with offset commits), and read_committed consumers (skip uncommitted data). Together they guarantee each message is processed exactly once within Kafka. But the moment your consumer writes to an external system, you need application-level idempotency. Know when exactly-once is worth the cost -- financial transactions, yes; logging, no. This nuanced understanding is what separates a strong system design answer from a textbook recitation.