RabbitMQ vs Kafka vs SQS — Choosing the Right Queue

TL;DR

RabbitMQ routes messages smartly so consumers stay dumb. Kafka keeps a dumb log so consumers can be smart. SQS lets you skip the infrastructure argument entirely. Pick wrong and you'll spend six months migrating.

What It Is

Broker Comparison

A message queue sits between producers and consumers. Producer drops a message in, consumer picks it out, work gets done. Sounds simple. The devil is in the details — and the details differ wildly between these three.

The core tension: how much intelligence lives in the broker vs the consumer?

RabbitMQ is a traditional message broker. It routes, filters, priorities, and delivers. The broker does the thinking. Consumers just receive and process.

Kafka is a distributed commit log. It appends events, stores them for days or weeks, and lets consumers pull at their own pace. The broker is a glorified append-only file. Consumers track their own position.

SQS is AWS saying "you don't need to manage any of this." Fully managed, pay-per-message, invisible infrastructure. You trade control for operational simplicity.

LinkedIn built Kafka because RabbitMQ couldn't handle their event throughput. But most companies aren't LinkedIn. Shopify ran RabbitMQ for years and it worked fine. The right choice depends on your actual problem, not your aspirations.

RabbitMQ — The Smart Broker

RabbitMQ implements AMQP (Advanced Message Queuing Protocol). It's a traditional message broker with rich routing capabilities that date back to enterprise messaging patterns from the 2000s.

Exchanges and Routing

Messages don't go directly to queues. They go to exchanges. Exchanges route to queues based on rules. This indirection is what gives RabbitMQ its flexibility.

Producer → Exchange → Binding Rules → Queue(s) → Consumer(s)

Four exchange types:

Direct Exchange:
  routing_key = "payment.success"
  → routes to queue bound with key "payment.success"
  → one-to-one mapping, like addressing a letter

Fanout Exchange:
  → routes to ALL bound queues, ignores routing key
  → broadcast pattern: order placed → notify inventory,
    billing, shipping simultaneously

Topic Exchange:
  routing_key = "order.us.premium"
  → queue bound with "order.us.*" receives it
  → queue bound with "order.#" receives it
  → queue bound with "order.eu.*" does NOT
  → wildcard matching, the most flexible option

Headers Exchange:
  → routes based on message headers, not routing key
  → rarely used, but useful for complex routing logic

Per-Message Acknowledgment

RabbitMQ tracks which messages have been delivered and acknowledged. Consumer processes a message, sends an ACK back, broker removes it from the queue. If the consumer crashes mid-processing, the message gets redelivered.

# Consumer with manual acknowledgment
def callback(ch, method, properties, body):
    try:
        process_order(body)
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except Exception:
        # Reject and requeue
        ch.basic_nack(delivery_tag=method.delivery_tag,
                      requeue=True)

channel.basic_consume(queue='orders', on_message_callback=callback)

This per-message tracking is what makes RabbitMQ a true queue: once a message is consumed and acknowledged, it's gone. No replay, no going back.

Message Priority

RabbitMQ supports priority queues natively. You declare a queue with x-max-priority, and higher-priority messages jump ahead. Kafka has no concept of priority. SQS doesn't either.

# Declare a priority queue (max 10 priority levels)
channel.queue_declare(
    queue='tasks',
    arguments={'x-max-priority': 10}
)

# Publish with priority
channel.basic_publish(
    exchange='',
    routing_key='tasks',
    body='urgent-report',
    properties=pika.BasicProperties(priority=9)
)

This matters for job scheduling. If you need critical alerts to jump ahead of batch reports in the same queue, RabbitMQ handles it natively. With Kafka, you'd need separate topics and consumer logic.

RabbitMQ Strengths

Complex routing without code changes (just rebind queues)
Message priority for mixed-urgency workloads
Per-message acknowledgment and redelivery
Low latency for individual messages (~1ms)
Mature ecosystem with every language supported

RabbitMQ Weaknesses

Messages deleted after consumption — no replay
Throughput caps around 50K-80K msg/sec per node
Clustering is painful and partition-prone
Not built for event sourcing or stream processing

Kafka — The Distributed Log

Kafka isn't a queue. Stop calling it one. It's a distributed, partitioned, replicated commit log. This distinction matters because it changes what you can do with it.

The Log Abstraction

Every Kafka topic is divided into partitions. Each partition is an append-only, ordered, immutable sequence of records. Records are assigned sequential offsets.

Topic: user-events
  Partition 0: [offset 0][offset 1][offset 2]...[offset 9847]
  Partition 1: [offset 0][offset 1]...[offset 7213]
  Partition 2: [offset 0][offset 1]...[offset 11402]

Consumer group "analytics":
  Consumer A reads Partition 0 (currently at offset 9500)
  Consumer B reads Partition 1 (currently at offset 7200)
  Consumer C reads Partition 2 (currently at offset 11000)

Consumers pull messages. They control their own offset — their position in the log. The broker doesn't track whether a message was "consumed." It just stores data and serves reads.

This is the fundamental difference. RabbitMQ pushes and tracks. Kafka stores and serves.

Why the Log Model Wins for Streaming

Replay. A new analytics service joins the company? Point it at offset 0 and replay every event from the beginning. Try that with RabbitMQ — the messages are gone.

Multiple consumers. Five teams need the same event stream? Five consumer groups, each with their own offset. No message duplication on the broker side. RabbitMQ requires fanout exchanges and copies.

Retention. Configure Kafka to keep messages for 7 days, 30 days, or forever. It's a database of events. Uber keeps certain Kafka topics for months for auditing.

Partitions and Ordering

Kafka guarantees ordering within a partition, not across partitions. If you need all events for user 12345 processed in order, you must ensure they land in the same partition.

# Partition key ensures all events for the same user
# go to the same partition
producer.send(
    'user-events',
    key=b'user-12345',      # partition key
    value=b'{"action": "purchase", "amount": 99.99}'
)

# Kafka hashes the key: hash("user-12345") % num_partitions
# Same key always hits the same partition
# Within that partition, order is guaranteed

Gotcha

Adding partitions to a running topic changes the hash mapping. Key "user-12345" might route to a different partition after the change. Plan your partition count upfront — 12 is a reasonable starting point for most topics.

Kafka Strengths

Millions of messages per second per cluster
Message replay from any point in time
Multiple independent consumer groups
Built-in replication and fault tolerance
Natural fit for event sourcing and CQRS

Kafka Weaknesses

Operational overhead — ZooKeeper (or KRaft), brokers, topics, partitions
No per-message priority
No complex routing (consumers read entire topics)
Higher latency for single messages (~5-10ms vs ~1ms)
Overkill for simple task queues

SQS — The Managed Escape Hatch

AWS SQS exists so you never have to page someone at 3 AM because a broker node ran out of disk. No clusters to manage. No replication to configure. No capacity planning. Push messages in, pull them out.

Standard vs FIFO Queues

SQS comes in two flavors with very different guarantees:

Standard Queue:
  - At-least-once delivery (may deliver duplicates)
  - Best-effort ordering (NOT guaranteed)
  - Nearly unlimited throughput
  - Use for: work that can tolerate duplicates and disorder

FIFO Queue:
  - Exactly-once processing (dedup within 5-minute window)
  - Strict ordering within message groups
  - 3,000 messages/sec with batching (70K with high throughput mode)
  - Use for: order processing, financial transactions

Visibility Timeout

When a consumer receives a message from SQS, the message becomes invisible to other consumers for a configurable timeout period. This prevents two workers from processing the same job.

import boto3

sqs = boto3.client('sqs')

# Receive a message — it becomes invisible for 30 seconds
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=1,
    VisibilityTimeout=30  # seconds
)

message = response['Messages'][0]

try:
    process_payment(message['Body'])
    # Success — delete the message
    sqs.delete_message(
        QueueUrl=queue_url,
        ReceiptHandle=message['ReceiptHandle']
    )
except Exception:
    # Don't delete — message reappears after 30 seconds
    # Another worker will pick it up
    pass

If the consumer crashes, the message reappears after the timeout expires. Set it too short and you get duplicate processing. Set it too long and failed messages take forever to retry.

Dead Letter Queues

After a message fails N times (configurable), SQS automatically moves it to a dead letter queue. This prevents poison pills from blocking your queue forever.

{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456:orders-dlq",
    "maxReceiveCount": 3
  }
}

Every production SQS queue needs a DLQ. No exceptions. If you skip this, a single malformed message can clog your pipeline for hours while it retries endlessly.

SNS is the pub/sub complement to SQS. One SNS topic can push to multiple SQS queues, Lambda functions, HTTP endpoints, or email addresses simultaneously.

Order Placed event
  → SNS Topic: "order-events"
     → SQS Queue: "inventory-updates"  (inventory service)
     → SQS Queue: "billing-jobs"       (billing service)
     → SQS Queue: "shipping-prep"      (shipping service)
     → Lambda: "analytics-ingest"      (data pipeline)

This is AWS's answer to Kafka's multiple consumer groups. The difference: SNS copies the message to each subscriber. Kafka stores one copy and lets multiple groups read it independently. At small scale, the difference doesn't matter. At millions of messages per day, the SNS approach costs more.

Redis Streams — The Forgotten Middle Ground

Here's a spicy take: for most startups, Redis Streams is the right choice and nobody picks it.

Redis Streams gives you a Kafka-like append-only log with consumer groups, acknowledgment, and replay — inside the Redis you're already running. No new infrastructure. No new operational playbook.

# Producer: append to stream
XADD orders * user_id 1001 product "Widget" amount 29.99

# Consumer group: create and read
XGROUP CREATE orders analytics-group 0
XREADGROUP GROUP analytics-group consumer-1 COUNT 10 BLOCK 5000 STREAMS orders >

# Acknowledge processed messages
XACK orders analytics-group 1681234567890-0

Redis Streams won't replace Kafka at Netflix scale. But if you have fewer than 100K messages per minute and you already have Redis, adding Kafka is adding complexity for bragging rights.

Comparison Table

Feature	RabbitMQ	Kafka	SQS	Redis Streams
Model	Push (broker routes)	Pull (consumer reads log)	Pull (poll-based)	Pull (consumer groups)
Ordering	Per-queue FIFO	Per-partition only	FIFO queues only	Per-stream FIFO
Delivery	At-least-once (ACK)	At-least-once (offsets)	At-least-once / exactly-once (FIFO)	At-least-once (XACK)
Replay	No (deleted after ACK)	Yes (retained by time/size)	No (deleted after delete)	Yes (retained by size/time)
Throughput	~50-80K msg/sec	Millions msg/sec	Nearly unlimited (managed)	~100K-500K msg/sec
Priority	Yes (native)	No	No	No
Routing	Exchanges (direct, topic, fanout, headers)	Topics only	Queue per use case	Stream key only
Message size	No hard limit (128MB default)	1MB default (configurable)	256KB	~100MB (memory bound)
Retention	Until consumed	Configurable (days/size/forever)	4 days default (max 14)	Configurable (count/memory)
Ops overhead	Medium (clustering is fragile)	High (brokers + ZK/KRaft)	Zero (managed)	Low (use existing Redis)
Cost model	Self-hosted or CloudAMQP	Self-hosted or Confluent Cloud	Pay per million requests	Self-hosted (part of Redis)

Decision Flowchart

When an interviewer asks "which queue would you use?" — don't just pick Kafka because it's trendy. Walk through this:

Need message replay or event sourcing?
  → YES → Kafka (or Redis Streams for small scale)
  → NO  ↓

Need message priority?
  → YES → RabbitMQ
  → NO  ↓

Need complex routing (topic patterns, header-based)?
  → YES → RabbitMQ
  → NO  ↓

Want zero ops burden?
  → YES → SQS (+SNS for fan-out)
  → NO  ↓

Already running Redis, < 100K msg/min?
  → YES → Redis Streams
  → NO  ↓

High throughput, multiple consumer groups?
  → YES → Kafka
  → NO  → SQS (simplest default)

When to Use Each — Real-World Examples

RabbitMQ fits when:

Email sending queue with priority (transactional emails jump ahead of marketing blasts)
Task distribution where the broker should round-robin across workers
Complex event routing where producers shouldn't know about consumers

Kafka fits when:

Event sourcing (order-created, order-paid, order-shipped — the full audit trail)
Multiple teams consuming the same event stream independently
Real-time + batch processing from the same data (Lambda architecture)
Change data capture (Debezium streams database changes to Kafka)

SQS fits when:

Background job processing (image resizing, PDF generation)
Decoupling microservices on AWS without managing infrastructure
Workloads with unpredictable spikes (SQS scales infinitely)

Redis Streams fits when:

You already have Redis and need lightweight streaming
Chat messages, activity feeds, notification queues
Prototyping before committing to Kafka

Patterns for System Design Interviews

Pattern 1: Async Task Processing

[Web Server] → [SQS Queue] → [Worker Fleet]
                    ↓ (after 3 failures)
              [Dead Letter Queue] → [Alert + Manual Review]

User uploads a video. Web server pushes a message to SQS. Worker fleet pulls, transcodes, stores. If a message fails 3 times, it hits the DLQ. An alarm triggers. Someone investigates. This is the most common messaging pattern in the industry.

Pattern 2: Event-Driven Microservices

[Order Service] → [Kafka: order-events]
                       ↓         ↓          ↓
              [Inventory]  [Billing]  [Analytics]
              (group: inv) (group: bill) (group: data)

Each service has its own consumer group. Each reads the same events independently. If analytics is down for an hour, it catches up by reading from its last committed offset. No data lost. Uber and Netflix both use this pattern extensively.

Pattern 3: Priority Task Queue

[API Server] → [RabbitMQ Exchange] → [Priority Queue]
                                     priority 9: critical alerts
                                     priority 5: standard jobs
                                     priority 1: batch reports
                                         ↓
                                    [Worker Pool]

Workers always grab the highest-priority message first. A batch report generation won't block a critical security alert. You can't do this with Kafka or SQS without maintaining separate queues and custom consumer logic.

Trade-offs Table

Trade-off	Choose A	Choose B
Replay vs Simplicity	Kafka (replay, audit trail)	RabbitMQ (simpler, messages gone after ACK)
Throughput vs Routing	Kafka (millions/sec, no routing)	RabbitMQ (lower throughput, rich routing)
Control vs Ops	Self-hosted Kafka/RabbitMQ (full control)	SQS (zero ops, less flexibility)
Latency vs Durability	RabbitMQ (1ms, in-memory delivery)	Kafka (5-10ms, always written to disk)
Single consumer vs Multi	SQS (one consumer per message)	Kafka (multiple consumer groups)
Priority vs Parallelism	RabbitMQ (priority queues)	Kafka (partition-level parallelism)

Smart Broker Consumer

Interview Gotchas

Gotcha 1: Kafka Is Not a Queue

If you say "we'll use Kafka as a message queue," you're already confused. Kafka is a log. Messages aren't deleted after consumption. Multiple consumer groups can read the same data. This changes your architecture fundamentally — embrace it, don't fight it.

Gotcha 2: SQS Doesn't Guarantee Ordering (Standard)

Standard SQS queues are best-effort ordering only. If you need strict FIFO, use FIFO queues — but they cap at 3,000 msg/sec without high throughput mode. Interviewers love asking about this gap.

Gotcha 3: RabbitMQ Clustering Is Not Kafka Replication

RabbitMQ can cluster, but it's not designed for the kind of horizontal scaling Kafka does. A RabbitMQ cluster is primarily for HA, not throughput. If you need 500K msg/sec, RabbitMQ will struggle even with clustering.

Gotcha 4: Kafka Partition Count Is (Mostly) Permanent

You can add partitions but you can't remove them. And adding partitions changes key-to-partition mapping. Plan your partition count carefully upfront, because changing it later means reprocessing or data skew.

Gotcha 5: Don't Default to Kafka

In every queue-related interview question, candidates default to Kafka. It's the safe-sounding answer. But if the problem is "send emails in the background," you don't need a distributed commit log with ZooKeeper. You need SQS or even a simple database-backed queue. Show you can pick the right tool, not the fanciest one.