Delivery Guarantees and Idempotent Consumers

TL;DR

Exactly-once delivery is a distributed systems myth. What actually works: at-least-once delivery combined with idempotent consumers. Master this pair and you'll never double-charge a customer or lose an order.

What It Is

Dead Letter Queue

Every message system makes a promise about delivery. That promise determines everything downstream — whether you can lose messages, whether you'll see duplicates, and how much defensive code your consumers need to write.

Three guarantees exist. Only two are practical.

The distinction matters enormously in interviews. Candidates who say "we need exactly-once delivery" without explaining how reveal they don't understand the problem. Candidates who say "at-least-once with idempotent consumers" show real engineering maturity.

Stripe processes millions of payment events daily. They don't rely on exactly-once delivery from their message broker. They make every consumer idempotent. That's the industry standard.

At-Most-Once — Fire and Forget

The producer sends a message and moves on. No acknowledgment. No retry. If the network hiccups or the broker crashes, the message is gone forever.

# At-most-once: send and don't wait
producer.send('metrics', value=b'cpu_usage:72')
# Did it arrive? Who knows. Don't care.

This sounds reckless, but it's the right choice when losing a message is cheaper than the overhead of guaranteeing delivery.

How It Works

Producer sends message → Broker receives (maybe) → Consumer processes (maybe)

Failure scenarios:
1. Network drops the message → lost forever
2. Broker crashes before writing to disk → lost forever
3. Consumer crashes after receiving → message gone, no redelivery

The producer doesn't retry because retrying might create duplicates. The consumer doesn't acknowledge because the broker doesn't track delivery. Maximum speed, minimum guarantees.

When At-Most-Once Makes Sense

Metrics and monitoring. If you lose one CPU measurement out of thousands per minute, your p99 graph barely shifts. The alternative — retrying every metrics payload with acknowledgment — adds latency and load to a system that produces millions of data points.

Log aggregation. Losing a log line is annoying. Dropping logs under heavy load to keep the system alive is a valid trade-off. Datadog's agent uses this approach — when the buffer is full, old entries get dropped, not queued.

Real-time gaming. Player position updates happen 60 times per second. A missed frame matters less than a late frame. Retrying stale position data is worse than skipping it.

# Kafka producer configured for at-most-once
producer = KafkaProducer(
    acks=0,           # don't wait for broker acknowledgment
    retries=0,        # never retry
    linger_ms=0       # send immediately
)

Gotcha

At-most-once is NOT acceptable for financial transactions, order processing, or anything where losing a message means losing money or corrupting state. If an interviewer hears "we'll use at-most-once for the payment queue," that's a red flag.

At-Least-Once — The Industry Default

The producer sends a message and waits for an acknowledgment. If the acknowledgment doesn't arrive within a timeout, the producer retries. The message will definitely arrive. It might arrive more than once.

# At-least-once: Kafka producer with retries
producer = KafkaProducer(
    acks='all',       # wait for all replicas to acknowledge
    retries=5,        # retry up to 5 times
    retry_backoff_ms=100
)

# This message WILL be delivered. Possibly twice.
future = producer.send('orders', value=b'{"order_id": 9001}')
future.get(timeout=10)  # block until ack received

Why Duplicates Happen

The tricky part: the producer doesn't know why the ack didn't arrive.

Scenario — The Phantom Duplicate:

1. Producer sends message to broker
2. Broker receives message and writes to disk
3. Broker sends ACK back to producer
4. Network drops the ACK
5. Producer thinks delivery failed
6. Producer retries — sends same message again
7. Broker receives the duplicate and writes it

Result: two copies of the same message in the queue

The broker successfully stored the message the first time. But the producer never got confirmation, so it retried. Now there are two copies. The consumer will process both unless it's built to handle duplicates.

This isn't a bug. It's the fundamental trade-off. You can't distinguish "message lost" from "ack lost" without adding a round of coordination — which is exactly what exactly-once protocols do.

At-Least-Once Everywhere

At-least-once is the default for almost every production messaging system:

SQS Standard: delivers at least once, may deliver duplicates
Kafka with acks=all: producer retries ensure delivery
RabbitMQ with publisher confirms: broker acks receipt, producer retries on failure
Most webhook systems: retry with exponential backoff until 200 response

The pattern is universal because it's the best balance of reliability and complexity. You accept duplicates on the transport layer and handle them on the application layer.

Exactly-Once — The Mirage

Exactly-once delivery means every message is delivered once, processed once, and has its effect applied once. No losses, no duplicates. The holy grail.

Here's the uncomfortable truth: true exactly-once delivery across distributed systems is impossible in the general case. The Two Generals Problem proves this. You can't get two parties to agree on a shared state with an unreliable channel.

But wait — Kafka claims exactly-once semantics. SQS FIFO claims exactly-once. Are they lying?

No. They're narrowing the scope.

What "Exactly-Once" Actually Means

Kafka's exactly-once (Transactional API): guarantees that a produce-consume-produce chain within Kafka is atomic. A consumer reads from topic A, processes, and writes to topic B — either both the read offset commit and the write happen, or neither does. This is exactly-once within the Kafka ecosystem. The moment your consumer calls an external API or writes to a database, you're back to at-least-once.

# Kafka transactional producer
producer = KafkaProducer(
    transactional_id='order-processor-1',
    enable_idempotence=True  # dedup on producer side
)

producer.init_transactions()

try:
    producer.begin_transaction()

    # Read from input topic
    records = consumer.poll(timeout_ms=1000)

    for record in records:
        result = process(record.value)
        # Write to output topic
        producer.send('processed-orders', value=result)

    # Commit consumer offsets AND producer writes atomically
    producer.send_offsets_to_transaction(
        consumer.position(),
        consumer.group_metadata()
    )
    producer.commit_transaction()

except Exception:
    producer.abort_transaction()

SQS FIFO exactly-once: deduplication based on a MessageDeduplicationId. If SQS receives two messages with the same dedup ID within a 5-minute window, it discards the second one. This prevents producer-side duplicates. But if your consumer crashes after processing but before deleting the message, it'll be processed again after the visibility timeout.

The Practical Answer

In every real production system, "exactly-once" is actually: at-least-once delivery + idempotent processing. The transport layer might deliver duplicates. The application layer ensures that processing a message twice produces the same result as processing it once.

This is what you should say in an interview. Not "we'll use exactly-once." Instead: "We'll use at-least-once delivery and make our consumers idempotent."

The Transactional Outbox Pattern

The practical answer above works when your consumer reads from a queue and writes to a database. But what about the producer side? You often need to atomically write to your database AND publish a message to a queue. This is the dual-write problem -- if the DB write succeeds but the queue publish fails (network blip, broker down), your system is inconsistent. The downstream consumer never learns about the change.

The Transactional Outbox solves this by never publishing directly to the queue. Instead, you write your business data AND the message to an "outbox" table in the same database transaction. A separate process reads the outbox and publishes to the queue. If publishing fails, it retries. The message is already durably stored in your DB -- it cannot be lost.

# Single database transaction -- atomic, no dual-write risk
with db.transaction():
    db.execute("INSERT INTO orders (order_id, amount) VALUES (%s, %s)",
               (order_id, amount))
    db.execute(
        "INSERT INTO outbox (id, topic, payload, created_at) "
        "VALUES (%s, %s, %s, NOW())",
        (msg_id, 'order-events', json.dumps({'order_id': order_id, 'amount': amount}))
    )

# Separate relay process (polling or CDC) publishes outbox rows to Kafka/SQS
# On successful publish, mark the outbox row as sent or delete it

Two relay strategies: (1) Polling -- a background job queries SELECT * FROM outbox WHERE sent = false ORDER BY created_at every 500ms. Simple but adds latency and DB load. (2) CDC (Change Data Capture) -- tools like Debezium tail the database's transaction log and publish outbox inserts to Kafka in near-real-time with sub-second latency. CDC is the production-grade approach for high-throughput systems.

Interview signal

If the interviewer asks "what happens if your service writes to the DB but crashes before publishing the event?" -- the Outbox Pattern is the answer. It shows you understand that distributed transactions between a DB and a message broker are impractical, and you know the standard workaround.

Idempotent Consumer Patterns

An operation is idempotent if applying it multiple times produces the same result as applying it once. SET balance = 500 is idempotent. INCREMENT balance BY 50 is not.

Here are the battle-tested patterns.

Pattern 1: Deduplication Table

Store every processed message ID. Before processing a new message, check if you've seen it before.

def process_message(message):
    msg_id = message['id']

    # Check if already processed (atomic read)
    if db.execute(
        "SELECT 1 FROM processed_messages WHERE id = %s",
        (msg_id,)
    ).fetchone():
        log.info(f"Duplicate message {msg_id}, skipping")
        return  # already handled

    # Process the message
    result = handle_order(message['payload'])

    # Record processing and apply effect in one transaction
    with db.transaction():
        db.execute(
            "INSERT INTO processed_messages (id, processed_at) "
            "VALUES (%s, NOW())",
            (msg_id,)
        )
        db.execute(
            "INSERT INTO orders (order_id, amount) VALUES (%s, %s)",
            (result['order_id'], result['amount'])
        )

Critical Detail

The dedup check and the business logic must be in the same transaction. If you check, then process, then insert the dedup record — and crash between process and insert — you'll reprocess on recovery. The dedup insert and the business effect must be atomic.

Cleanup: processed message IDs accumulate. Run a background job to delete records older than your maximum possible retry window (for SQS, that's the maximum retention period of 14 days).

Pattern 2: Natural Idempotency — Upsert Instead of Insert

If your operation is naturally idempotent, you don't need a dedup table at all.

-- NOT idempotent: fails on duplicate, or inserts twice
INSERT INTO orders (order_id, amount) VALUES ('ORD-123', 99.99);

-- Idempotent: same result whether it runs once or ten times
INSERT INTO orders (order_id, amount) VALUES ('ORD-123', 99.99)
ON CONFLICT (order_id) DO NOTHING;

-- Also idempotent: SET is idempotent, INCREMENT is not
UPDATE accounts SET balance = 500 WHERE user_id = 1001;

-- NOT idempotent: balance grows with each duplicate
UPDATE accounts SET balance = balance + 50 WHERE user_id = 1001;

Stripe uses this pattern extensively. Every API call includes an Idempotency-Key header. The server stores the result keyed by that ID. Repeat the same request, get the same response, no side effects.

Pattern 3: Version Checks / Optimistic Locking

Only apply an update if the current state matches what you expect.

-- Only update if version hasn't changed since we read it
UPDATE inventory
SET quantity = 45, version = version + 1
WHERE product_id = 'SKU-9912' AND version = 7;

-- If version is now 8 (someone else updated), this affects 0 rows
-- Consumer detects 0 affected rows and handles accordingly

def process_inventory_update(message):
    product_id = message['product_id']
    new_quantity = message['new_quantity']
    expected_version = message['version']

    rows_affected = db.execute(
        "UPDATE inventory SET quantity = %s, version = version + 1 "
        "WHERE product_id = %s AND version = %s",
        (new_quantity, product_id, expected_version)
    )

    if rows_affected == 0:
        log.warning(f"Version conflict for {product_id}, "
                    f"expected version {expected_version}")
        # Either retry with fresh version or skip

This pattern shines when multiple consumers might update the same record. The version check ensures that stale or duplicate updates don't silently overwrite newer data.

Dead Letter Queues

A dead letter queue (DLQ) catches messages that fail repeatedly. After N processing attempts, the message moves to a separate queue for inspection instead of blocking the main queue forever.

Main Queue → Consumer tries → fails → retry → fails → retry → fails
                                                              ↓
                                                     Dead Letter Queue
                                                     (manual inspection)
                                                              ↓
                                                     Alert fires → 
                                                     engineer investigates

Why Every Queue Needs a DLQ

Without a DLQ, a poison pill message — one that always fails — retries infinitely. It blocks the consumer. Other messages pile up behind it. Your entire pipeline stalls because of one bad record.

# SQS dead letter queue configuration
sqs.create_queue(
    QueueName='orders-dlq'
)

sqs.set_queue_attributes(
    QueueUrl=main_queue_url,
    Attributes={
        'RedrivePolicy': json.dumps({
            'deadLetterTargetArn': dlq_arn,
            'maxReceiveCount': '3'  # DLQ after 3 failures
        })
    }
)

Poison Pill Messages

A poison pill is a message that can never be processed. Corrupted JSON. A reference to a deleted resource. A schema the consumer doesn't understand.

def process_with_poison_pill_detection(message):
    try:
        payload = json.loads(message.body)
    except json.JSONDecodeError:
        # This will NEVER succeed on retry
        # Move to DLQ immediately, don't waste retry attempts
        log.error(f"Unparseable message: {message.body[:200]}")
        move_to_dlq(message)
        return

    if 'order_id' not in payload:
        # Missing required field — retrying won't fix it
        log.error(f"Missing order_id in message: {payload}")
        move_to_dlq(message)
        return

    # Legitimate processing — transient failures will retry
    process_order(payload)

The key insight: distinguish between transient failures (retry) and permanent failures (DLQ immediately). A database timeout is transient — retry after backoff. A malformed message is permanent — retrying wastes time and clogs the queue.

Backpressure

Backpressure happens when consumers can't keep up with producers. Messages pile up. Queue depth grows. Latency spikes. Eventually, the queue hits its storage limit and starts dropping messages or rejecting publishes.

Detecting Backpressure

Monitor these metrics:

Queue depth (message count) — growing = consumer is behind
Consumer lag (Kafka) — offset difference between latest and committed
Processing latency — how long each message takes
Error rate — failing messages get retried, adding more load

Handling Backpressure

Scale consumers horizontally. Add more workers. In Kafka, each partition can have one consumer per group — so more partitions means more parallelism. In SQS, just start more pollers.

Rate limit the producer. If your producer generates work faster than consumers can handle, throttle it. Better to reject requests at the API layer with a 429 than to pile up a million messages.

# Producer-side rate limiting
from ratelimit import limits

@limits(calls=1000, period=60)  # max 1000 messages per minute
def publish_event(event):
    producer.send('events', value=event)

Circuit breaker. If the downstream service is consistently failing, stop sending messages temporarily. Let the system recover instead of piling up failures.

The pattern: track consecutive failures per downstream service. After N failures, reject requests immediately for a cooldown period. After cooldown, send one test request. If it succeeds, resume normal traffic. If it fails, extend the cooldown. This prevents a failing service from being hammered into oblivion by retries.

Visibility Timeout Deep Dive (SQS)

The visibility timeout is SQS's mechanism for preventing duplicate processing without requiring consumer coordination. It deserves special attention because getting it wrong is a common source of bugs.

Timeline of a visibility timeout:

t=0s   Consumer A receives message. Visibility timeout = 30s.
       Message becomes INVISIBLE to all other consumers.

t=20s  Consumer A is still processing.
       28 other consumers polling the queue do NOT see this message.

t=25s  Consumer A finishes processing. Deletes the message.
       Done. Everything worked.

--- BUT if Consumer A crashes: ---

t=0s   Consumer A receives message. Visibility timeout = 30s.
t=15s  Consumer A crashes.
t=30s  Visibility timeout expires. Message becomes VISIBLE again.
t=31s  Consumer B receives the message and processes it.

Setting the Right Timeout

Too short (10s for a job that takes 60s):
  → Message reappears while Consumer A is still processing
  → Consumer B picks it up → duplicate processing
  → Now both consumers write results → data corruption

Too long (300s for a job that takes 5s):
  → If consumer crashes at t=1s, message is stuck invisible for 299s
  → That's 5 minutes of unnecessary delay before retry

Right approach:
  → Set timeout to 1.5x-2x the expected processing time
  → For variable workloads, use heartbeats (extend timeout while processing)

# Extending visibility timeout during long processing
def process_long_running_job(message, sqs, queue_url):
    receipt_handle = message['ReceiptHandle']

    for chunk in process_in_chunks(message['Body']):
        handle_chunk(chunk)
        # Extend visibility timeout — "I'm still working on this"
        sqs.change_message_visibility(
            QueueUrl=queue_url,
            ReceiptHandle=receipt_handle,
            VisibilityTimeout=30  # reset the clock
        )

    # Done — delete the message
    sqs.delete_message(
        QueueUrl=queue_url,
        ReceiptHandle=receipt_handle
    )

Patterns for System Design Interviews

Pattern 1: Payment Processing Pipeline

[API] → [SQS FIFO Queue] → [Payment Worker]
              ↓                    ↓
         dedup by              upsert result
         idempotency key       (natural idempotency)
              ↓                    ↓
         [DLQ after 3]       [Payments DB]
         → alert →           ON CONFLICT DO NOTHING
         manual review

Why FIFO? Ordering within a customer matters — charge before refund. Why upsert? If the worker processes the same payment twice, the second write is a no-op. Why DLQ? A malformed payment request shouldn't block legitimate ones.

Pattern 2: Event Sourcing with Idempotent Projections

[Services] → [Kafka: events] → [Projection Worker]
                                     ↓
                              Read event version
                              Compare with stored version
                              Apply only if newer
                                     ↓
                               [Read Model DB]
                               (versioned records)

The projection worker maintains a materialized view from the event stream. Each record stores its last-applied event version. If a duplicate event arrives, the version check catches it. This is how many CQRS systems handle the read side.

Pattern 3: Webhook Delivery System

[Event] → [Producer] → [Queue]
                          ↓
                    [Webhook Worker]
                    POST to customer URL
                          ↓
                    Success (2xx)? → ACK message → done
                    Failure (5xx)? → NACK → retry with backoff
                    Failure (4xx)? → DLQ → customer misconfigured

Shopify sends millions of webhooks daily. Each delivery attempt is idempotent from the broker's perspective — the worker sends and checks the response. The customer's endpoint must also be idempotent, which is why webhook documentation always says "you may receive the same event multiple times."

Trade-offs Table

Trade-off	Choose A	Choose B
Speed vs Safety	At-most-once (fastest, may lose data)	At-least-once (slower, never loses data)
Simplicity vs Correctness	Skip dedup (simpler code)	Dedup table (guaranteed correctness)
Throughput vs Ordering	Unordered (parallel processing)	Ordered (sequential, slower)
Fast retry vs Backoff	Immediate retry (faster recovery)	Exponential backoff (protects downstream)
Short timeout vs Long timeout	Short visibility timeout (fast retry)	Long timeout (prevents duplicates)
Generic dedup vs Natural idempotency	Dedup table (works everywhere)	Upsert/SET (no extra table, but not always possible)

Delivery Guarantees

Interview Gotchas

Gotcha 1: Exactly-Once Is a Lie (In the General Case)

If a candidate says "Kafka supports exactly-once" without qualification, push back. Kafka's exactly-once is scoped to its transactional API — reading from one topic and writing to another within Kafka. The moment you call an HTTP API or write to a database from your consumer, you're back to at-least-once. The right answer is always: at-least-once + idempotent consumer.

Gotcha 2: The Dedup Check Isn't Enough Alone

Checking a dedup table and then applying the business logic in separate transactions is a race condition. If the process crashes between the two steps, you'll reprocess on recovery. The dedup insert and the business effect must happen in the same database transaction.

Gotcha 3: INCREMENT Is Not Idempotent

UPDATE balance SET amount = amount + 50 is not idempotent. Processing the same message twice adds 100 instead of 50. Either use SET with absolute values, or use a dedup table to prevent double-processing. This trips up many candidates who confuse "idempotent operations" with "retry-safe operations."

Gotcha 4: DLQ Is Not Optional

Every production queue needs a dead letter queue. Every single one. Skipping the DLQ in your system design means a single bad message can halt your entire pipeline. Interviewers will notice if you forget it.

Gotcha 5: Visibility Timeout Must Match Processing Time

Setting SQS visibility timeout to 30 seconds for a job that takes 2 minutes guarantees duplicate processing. The message reappears while the first consumer is still working on it. Another consumer picks it up. Now you have two workers charging the same customer. Always set timeout to at least 2x expected processing time, or use heartbeats.