Retries, Dead Letters, Poison Messages

TL;DR

Messages fail. Your retry strategy determines whether failures recover gracefully or cascade into outages. Use exponential backoff with jitter to avoid thundering herds. Route messages that exhaust retries to a dead letter queue for investigation. Watch for poison messages that crash consumers on every attempt. And the question interviewers love most: the dual-write problem -- what happens when the DB write succeeds but the queue write fails?

Why Messages Fail

Before picking a retry strategy, understand why messages fail. The failure type determines the correct response.

Failure Type	Example	Retryable?	Strategy
Transient	Network blip, DB connection reset, 503 from downstream	Yes	Retry with backoff
Throttled	Rate limit hit (429), API quota exceeded	Yes	Retry with longer backoff
Bug in consumer	NullPointerException, missing field	No	Fix code, replay from DLQ
Invalid message	Corrupted JSON, missing required fields	No	Send to DLQ immediately
Downstream outage	Payment provider is down for 2 hours	Yes	Retry with long delays, circuit breaker

Retrying non-retryable errors wastes resources

If a message has invalid JSON, retrying it 5 times with exponential backoff just delays the inevitable. Validate early and route bad messages to the DLQ immediately.

Retry Strategies Compared

Immediate Retry

Attempt 1: fail → Attempt 2: fail → Attempt 3: fail → DLQ
           0s              0s              0s

Only useful for genuinely transient errors (TCP reset mid-connection). If the downstream service is struggling, hammering it immediately makes things worse.

Fixed Delay

Attempt 1: fail → wait 5s → Attempt 2: fail → wait 5s → Attempt 3: fail → DLQ

Better, but if 1,000 messages fail at the same time, they all retry at exactly the same time. You've turned one spike into periodic spikes.

Exponential Backoff

Attempt 1: fail → wait 1s → Attempt 2: fail → wait 2s → Attempt 3: fail → wait 4s → Attempt 4: fail → wait 8s → DLQ

Each retry waits twice as long. This gives the downstream service progressively more breathing room. But there's still a problem: synchronized retries.

Exponential Backoff with Jitter (The Right Answer)

When 500 messages fail at t=0, exponential backoff alone means they all retry at t=1s, then t=2s, then t=4s. Adding jitter spreads them out randomly.

import random
import time

def retry_with_jitter(attempt: int, base_delay: float = 1.0, max_delay: float = 60.0) -> float:
    """
    Full jitter strategy (recommended by AWS).
    Delay = random between 0 and min(max_delay, base_delay * 2^attempt)
    """
    exponential = min(max_delay, base_delay * (2 ** attempt))
    jitter = random.uniform(0, exponential)
    return jitter

# Attempt 0: random(0, 1)   → e.g., 0.7s
# Attempt 1: random(0, 2)   → e.g., 1.3s
# Attempt 2: random(0, 4)   → e.g., 2.9s
# Attempt 3: random(0, 8)   → e.g., 5.1s
# Attempt 4: random(0, 16)  → e.g., 11.7s
# Attempt 5: random(0, 32)  → e.g., 28.4s

Retry with jitter: spreading retries randomly vs synchronized spikes

AWS's own engineering blog recommends full jitter over equal jitter or decorrelated jitter. The randomness is the feature, not a bug.

Strategy Summary

Strategy	When to Use
Immediate	Never in production (maybe local dev)
Fixed delay	Simple tasks with independent failures
Exponential backoff	Standard choice for most systems
Exponential + jitter	Default for everything in production

Dead Letter Queues: Where Failed Messages Go to Wait

A dead letter queue (DLQ) catches messages that have exhausted all retry attempts. Without a DLQ, failed messages either disappear (data loss) or retry forever (resource waste).

Dead letter queue: failed messages routed after exhausting retries

SQS DLQ Configuration

{
  "QueueName": "report-jobs",
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:report-jobs-dlq",
    "maxReceiveCount": 5
  }
}

After 5 receives without deletion, SQS automatically moves the message to the DLQ.

What to Do With DLQ Messages

A DLQ is not a graveyard -- it's a triage room.

def process_dlq():
    """Periodic job: inspect and redrive DLQ messages."""
    messages = sqs.receive_message(QueueUrl=DLQ_URL, MaxNumberOfMessages=10)

    for msg in messages.get("Messages", []):
        body = json.loads(msg["Body"])

        if is_invalid_format(body):
            # Bad data -- log, alert, delete
            log.error(f"Invalid message format: {body}")
            sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=msg["ReceiptHandle"])

        elif is_known_bug(body):
            # Bug has been fixed -- redrive to main queue
            sqs.send_message(QueueUrl=MAIN_QUEUE_URL, MessageBody=msg["Body"])
            sqs.delete_message(QueueUrl=DLQ_URL, ReceiptHandle=msg["ReceiptHandle"])

        else:
            # Unknown failure -- leave for manual investigation
            log.warning(f"Unknown DLQ message: {body}")

Set up alerts on DLQ depth

A growing DLQ means something is broken. Set a CloudWatch alarm for ApproximateNumberOfMessagesVisible > 0 on your DLQ. If it fires, you have messages that need attention.

Poison Messages: The Queue Blocker

A poison message is one that always crashes the consumer. Without safeguards, here's what happens:

Worker picks up message → crash → message returns to queue
Worker picks up message → crash → message returns to queue
Worker picks up message → crash → message returns to queue
... forever

The poison message blocks the entire queue because the worker crashes before it can process anything else. Meanwhile, legitimate messages pile up behind it.

How Poison Messages Happen

Malformed data that triggers an unhandled exception
A message referencing a deleted database record
Payload size exceeding a downstream limit
Character encoding issues (emoji in a field expecting ASCII)
Integer overflow (quantity: 99999999999999)

Defenses Against Poison Messages

MAX_RETRIES = 5

def safe_process(message):
    """Worker wrapper that prevents poison messages from blocking the queue."""

    # Defense 1: Content validation before processing
    try:
        payload = json.loads(message.body)
        validate_schema(payload)  # throws if invalid
    except (json.JSONDecodeError, ValidationError) as e:
        log.error(f"Invalid message, sending to DLQ: {e}")
        send_to_dlq(message)
        acknowledge(message)
        return

    # Defense 2: Max retry tracking
    retry_count = int(message.attributes.get("ApproximateReceiveCount", 0))
    if retry_count > MAX_RETRIES:
        log.error(f"Message exceeded {MAX_RETRIES} retries, sending to DLQ")
        send_to_dlq(message)
        acknowledge(message)
        return

    # Defense 3: Timeout wrapper
    try:
        with timeout(seconds=300):
            process_job(payload)
        acknowledge(message)
    except TimeoutError:
        log.error("Job timed out, will be retried via visibility timeout")
    except Exception as e:
        log.error(f"Processing failed (attempt {retry_count}): {e}")
        # Don't acknowledge -- let visibility timeout return it to queue

Worker Isolation

For critical queues, run each message in an isolated process. If the message causes a segfault or memory leak, only the child process dies.

import multiprocessing

def isolated_worker(message):
    """Process each message in a child process."""
    process = multiprocessing.Process(target=process_job, args=(message,))
    process.start()
    process.join(timeout=300)

    if process.exitcode == 0:
        acknowledge(message)
    elif process.exitcode is None:
        # Timed out
        process.kill()
        log.error("Worker process timed out")
    else:
        # Crashed
        log.error(f"Worker crashed with exit code {process.exitcode}")

The Dual-Write Problem

This is the most subtle and most frequently asked failure mode in async systems. It comes up in virtually every system design interview that involves queues.

The Scenario

Your API needs to do two things atomically:

Write a record to the database
Publish a message to the queue

@app.route("/api/orders", methods=["POST"])
def create_order():
    # Step 1: Write to database
    order = db.orders.insert(request.json)  # succeeds

    # Step 2: Publish to queue
    queue.send_message({"order_id": order.id})  # FAILS -- network error

    return jsonify(order), 201

The order is in the database, but the queue message was never sent. The downstream service that ships orders never finds out about it. The customer waits forever.

The reverse is equally bad: queue message is sent, but the DB write fails. Now the shipping service tries to ship an order that doesn't exist.

This shows up in every interview

Interviewers love this because it tests whether you understand distributed systems fundamentals. You can't wrap a database write and a queue publish in a single transaction -- they're different systems.

Solution 1: The Transactional Outbox Pattern

Instead of writing to the queue directly, write to an outbox table in the same database transaction as your business data.

Transactional outbox pattern: order and event in same DB transaction

-- Single transaction: both succeed or both fail
BEGIN;

INSERT INTO orders (id, customer_id, total)
VALUES ('ord_123', 'cust_456', 99.99);

INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload, published)
VALUES ('evt_789', 'order', 'ord_123', 'order.created',
        '{"order_id": "ord_123", "total": 99.99}', FALSE);

COMMIT;

A separate poller process reads unpublished outbox events and sends them to the queue:

def outbox_poller():
    """Runs every few seconds. Reads unpublished events, sends to queue."""
    while True:
        events = db.query(
            "SELECT * FROM outbox WHERE published = FALSE ORDER BY created_at LIMIT 100"
        )

        for event in events:
            queue.send_message(event.payload)
            db.execute(
                "UPDATE outbox SET published = TRUE WHERE id = %s", (event.id,)
            )

        time.sleep(2)

Solution 2: Change Data Capture (CDC)

Instead of polling, use CDC to stream database changes directly to the queue. Tools like Debezium read the database's transaction log (WAL in Postgres, binlog in MySQL) and publish changes.

CDC pipeline: application writes to DB, Debezium streams WAL to Kafka

Approach	Pros	Cons
Outbox + Poller	Simple, works with any queue, easy to debug	Polling latency (seconds), outbox table grows
Outbox + CDC	Near real-time, no polling overhead	More infrastructure (Debezium, Kafka Connect)
CDC on business tables	No application changes needed	Exposes internal schema, harder to evolve

Putting It All Together

Here's the complete failure handling pipeline:

Complete failure handling pipeline: validation, retries, DLQ, and redrive

The Retry Budget Mental Model

Every system has a retry budget -- the total number of retries you can afford before things start cascading.

Original messages/sec:     1,000
Failure rate:                  5%
Retries per failure:           3
Retry traffic:          1,000 × 5% × 3 = 150 msg/s
Total traffic:          1,000 + 150 = 1,150 msg/s (15% increase)

With 50% failure rate and 5 retries:
Retry traffic:          1,000 × 50% × 5 = 2,500 msg/s
Total traffic:          1,000 + 2,500 = 3,500 msg/s (250% increase!)

When a downstream service is struggling, unlimited retries can turn a partial outage into a complete one. This is why max retry counts and circuit breakers exist.

Key Takeaways

Concept	Details
Exponential + jitter	Default retry strategy for production systems
DLQ	Catches messages after max retries; triage, don't ignore
Poison messages	Validate early, track retry counts, isolate workers
Dual-write problem	DB + Queue can't share a transaction; use outbox pattern
Retry budget	High retry counts + high failure rate = cascading failure