RabbitMQ vs Kafka vs SQS — Choosing the Right Queue
TL;DR
RabbitMQ routes messages smartly so consumers stay dumb. Kafka keeps a dumb log so consumers can be smart. SQS lets you skip the infrastructure argument entirely. Pick wrong and you'll spend six months migrating.
What It Is

A message queue sits between producers and consumers. Producer drops a message in, consumer picks it out, work gets done. Sounds simple. The devil is in the details — and the details differ wildly between these three.
The core tension: how much intelligence lives in the broker vs the consumer?
RabbitMQ is a traditional message broker. It routes, filters, priorities, and delivers. The broker does the thinking. Consumers just receive and process.
Kafka is a distributed commit log. It appends events, stores them for days or weeks, and lets consumers pull at their own pace. The broker is a glorified append-only file. Consumers track their own position.
SQS is AWS saying "you don't need to manage any of this." Fully managed, pay-per-message, invisible infrastructure. You trade control for operational simplicity.
LinkedIn built Kafka because RabbitMQ couldn't handle their event throughput. But most companies aren't LinkedIn. Shopify ran RabbitMQ for years and it worked fine. The right choice depends on your actual problem, not your aspirations.
RabbitMQ — The Smart Broker
RabbitMQ implements AMQP (Advanced Message Queuing Protocol). It's a traditional message broker with rich routing capabilities that date back to enterprise messaging patterns from the 2000s.
Exchanges and Routing
Messages don't go directly to queues. They go to exchanges. Exchanges route to queues based on rules. This indirection is what gives RabbitMQ its flexibility.
Four exchange types:
Direct Exchange:
routing_key = "payment.success"
→ routes to queue bound with key "payment.success"
→ one-to-one mapping, like addressing a letter
Fanout Exchange:
→ routes to ALL bound queues, ignores routing key
→ broadcast pattern: order placed → notify inventory,
billing, shipping simultaneously
Topic Exchange:
routing_key = "order.us.premium"
→ queue bound with "order.us.*" receives it
→ queue bound with "order.#" receives it
→ queue bound with "order.eu.*" does NOT
→ wildcard matching, the most flexible option
Headers Exchange:
→ routes based on message headers, not routing key
→ rarely used, but useful for complex routing logic
Per-Message Acknowledgment
RabbitMQ tracks which messages have been delivered and acknowledged. Consumer processes a message, sends an ACK back, broker removes it from the queue. If the consumer crashes mid-processing, the message gets redelivered.
# Consumer with manual acknowledgment
def callback(ch, method, properties, body):
try:
process_order(body)
ch.basic_ack(delivery_tag=method.delivery_tag)
except Exception:
# Reject and requeue
ch.basic_nack(delivery_tag=method.delivery_tag,
requeue=True)
channel.basic_consume(queue='orders', on_message_callback=callback)
This per-message tracking is what makes RabbitMQ a true queue: once a message is consumed and acknowledged, it's gone. No replay, no going back.
Message Priority
RabbitMQ supports priority queues natively. You declare a queue with x-max-priority, and higher-priority messages jump ahead. Kafka has no concept of priority. SQS doesn't either.
# Declare a priority queue (max 10 priority levels)
channel.queue_declare(
queue='tasks',
arguments={'x-max-priority': 10}
)
# Publish with priority
channel.basic_publish(
exchange='',
routing_key='tasks',
body='urgent-report',
properties=pika.BasicProperties(priority=9)
)
This matters for job scheduling. If you need critical alerts to jump ahead of batch reports in the same queue, RabbitMQ handles it natively. With Kafka, you'd need separate topics and consumer logic.
RabbitMQ Strengths
- Complex routing without code changes (just rebind queues)
- Message priority for mixed-urgency workloads
- Per-message acknowledgment and redelivery
- Low latency for individual messages (~1ms)
- Mature ecosystem with every language supported
RabbitMQ Weaknesses
- Messages deleted after consumption — no replay
- Throughput caps around 50K-80K msg/sec per node
- Clustering is painful and partition-prone
- Not built for event sourcing or stream processing
Kafka — The Distributed Log
Kafka isn't a queue. Stop calling it one. It's a distributed, partitioned, replicated commit log. This distinction matters because it changes what you can do with it.
The Log Abstraction
Every Kafka topic is divided into partitions. Each partition is an append-only, ordered, immutable sequence of records. Records are assigned sequential offsets.
Topic: user-events
Partition 0: [offset 0][offset 1][offset 2]...[offset 9847]
Partition 1: [offset 0][offset 1]...[offset 7213]
Partition 2: [offset 0][offset 1]...[offset 11402]
Consumer group "analytics":
Consumer A reads Partition 0 (currently at offset 9500)
Consumer B reads Partition 1 (currently at offset 7200)
Consumer C reads Partition 2 (currently at offset 11000)
Consumers pull messages. They control their own offset — their position in the log. The broker doesn't track whether a message was "consumed." It just stores data and serves reads.
This is the fundamental difference. RabbitMQ pushes and tracks. Kafka stores and serves.
Why the Log Model Wins for Streaming
Replay. A new analytics service joins the company? Point it at offset 0 and replay every event from the beginning. Try that with RabbitMQ — the messages are gone.
Multiple consumers. Five teams need the same event stream? Five consumer groups, each with their own offset. No message duplication on the broker side. RabbitMQ requires fanout exchanges and copies.
Retention. Configure Kafka to keep messages for 7 days, 30 days, or forever. It's a database of events. Uber keeps certain Kafka topics for months for auditing.
Partitions and Ordering
Kafka guarantees ordering within a partition, not across partitions. If you need all events for user 12345 processed in order, you must ensure they land in the same partition.
# Partition key ensures all events for the same user
# go to the same partition
producer.send(
'user-events',
key=b'user-12345', # partition key
value=b'{"action": "purchase", "amount": 99.99}'
)
# Kafka hashes the key: hash("user-12345") % num_partitions
# Same key always hits the same partition
# Within that partition, order is guaranteed
Gotcha
Adding partitions to a running topic changes the hash mapping. Key "user-12345" might route to a different partition after the change. Plan your partition count upfront — 12 is a reasonable starting point for most topics.
Kafka Strengths
- Millions of messages per second per cluster
- Message replay from any point in time
- Multiple independent consumer groups
- Built-in replication and fault tolerance
- Natural fit for event sourcing and CQRS
Kafka Weaknesses
- Operational overhead — ZooKeeper (or KRaft), brokers, topics, partitions
- No per-message priority
- No complex routing (consumers read entire topics)
- Higher latency for single messages (~5-10ms vs ~1ms)
- Overkill for simple task queues
SQS — The Managed Escape Hatch
AWS SQS exists so you never have to page someone at 3 AM because a broker node ran out of disk. No clusters to manage. No replication to configure. No capacity planning. Push messages in, pull them out.
Standard vs FIFO Queues
SQS comes in two flavors with very different guarantees:
Standard Queue:
- At-least-once delivery (may deliver duplicates)
- Best-effort ordering (NOT guaranteed)
- Nearly unlimited throughput
- Use for: work that can tolerate duplicates and disorder
FIFO Queue:
- Exactly-once processing (dedup within 5-minute window)
- Strict ordering within message groups
- 3,000 messages/sec with batching (70K with high throughput mode)
- Use for: order processing, financial transactions
Visibility Timeout
When a consumer receives a message from SQS, the message becomes invisible to other consumers for a configurable timeout period. This prevents two workers from processing the same job.
import boto3
sqs = boto3.client('sqs')
# Receive a message — it becomes invisible for 30 seconds
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1,
VisibilityTimeout=30 # seconds
)
message = response['Messages'][0]
try:
process_payment(message['Body'])
# Success — delete the message
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
except Exception:
# Don't delete — message reappears after 30 seconds
# Another worker will pick it up
pass
If the consumer crashes, the message reappears after the timeout expires. Set it too short and you get duplicate processing. Set it too long and failed messages take forever to retry.
Dead Letter Queues
After a message fails N times (configurable), SQS automatically moves it to a dead letter queue. This prevents poison pills from blocking your queue forever.
{
"RedrivePolicy": {
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456:orders-dlq",
"maxReceiveCount": 3
}
}
Every production SQS queue needs a DLQ. No exceptions. If you skip this, a single malformed message can clog your pipeline for hours while it retries endlessly.
SNS + SQS — Fan-Out Pattern
SNS is the pub/sub complement to SQS. One SNS topic can push to multiple SQS queues, Lambda functions, HTTP endpoints, or email addresses simultaneously.
Order Placed event
→ SNS Topic: "order-events"
→ SQS Queue: "inventory-updates" (inventory service)
→ SQS Queue: "billing-jobs" (billing service)
→ SQS Queue: "shipping-prep" (shipping service)
→ Lambda: "analytics-ingest" (data pipeline)
This is AWS's answer to Kafka's multiple consumer groups. The difference: SNS copies the message to each subscriber. Kafka stores one copy and lets multiple groups read it independently. At small scale, the difference doesn't matter. At millions of messages per day, the SNS approach costs more.
Redis Streams — The Forgotten Middle Ground
Here's a spicy take: for most startups, Redis Streams is the right choice and nobody picks it.
Redis Streams gives you a Kafka-like append-only log with consumer groups, acknowledgment, and replay — inside the Redis you're already running. No new infrastructure. No new operational playbook.
# Producer: append to stream
XADD orders * user_id 1001 product "Widget" amount 29.99
# Consumer group: create and read
XGROUP CREATE orders analytics-group 0
XREADGROUP GROUP analytics-group consumer-1 COUNT 10 BLOCK 5000 STREAMS orders >
# Acknowledge processed messages
XACK orders analytics-group 1681234567890-0
Redis Streams won't replace Kafka at Netflix scale. But if you have fewer than 100K messages per minute and you already have Redis, adding Kafka is adding complexity for bragging rights.
Comparison Table
| Feature | RabbitMQ | Kafka | SQS | Redis Streams |
|---|---|---|---|---|
| Model | Push (broker routes) | Pull (consumer reads log) | Pull (poll-based) | Pull (consumer groups) |
| Ordering | Per-queue FIFO | Per-partition only | FIFO queues only | Per-stream FIFO |
| Delivery | At-least-once (ACK) | At-least-once (offsets) | At-least-once / exactly-once (FIFO) | At-least-once (XACK) |
| Replay | No (deleted after ACK) | Yes (retained by time/size) | No (deleted after delete) | Yes (retained by size/time) |
| Throughput | ~50-80K msg/sec | Millions msg/sec | Nearly unlimited (managed) | ~100K-500K msg/sec |
| Priority | Yes (native) | No | No | No |
| Routing | Exchanges (direct, topic, fanout, headers) | Topics only | Queue per use case | Stream key only |
| Message size | No hard limit (128MB default) | 1MB default (configurable) | 256KB | ~100MB (memory bound) |
| Retention | Until consumed | Configurable (days/size/forever) | 4 days default (max 14) | Configurable (count/memory) |
| Ops overhead | Medium (clustering is fragile) | High (brokers + ZK/KRaft) | Zero (managed) | Low (use existing Redis) |
| Cost model | Self-hosted or CloudAMQP | Self-hosted or Confluent Cloud | Pay per million requests | Self-hosted (part of Redis) |
Decision Flowchart
When an interviewer asks "which queue would you use?" — don't just pick Kafka because it's trendy. Walk through this:
Need message replay or event sourcing?
→ YES → Kafka (or Redis Streams for small scale)
→ NO ↓
Need message priority?
→ YES → RabbitMQ
→ NO ↓
Need complex routing (topic patterns, header-based)?
→ YES → RabbitMQ
→ NO ↓
Want zero ops burden?
→ YES → SQS (+SNS for fan-out)
→ NO ↓
Already running Redis, < 100K msg/min?
→ YES → Redis Streams
→ NO ↓
High throughput, multiple consumer groups?
→ YES → Kafka
→ NO → SQS (simplest default)
When to Use Each — Real-World Examples
RabbitMQ fits when:
- Email sending queue with priority (transactional emails jump ahead of marketing blasts)
- Task distribution where the broker should round-robin across workers
- Complex event routing where producers shouldn't know about consumers
Kafka fits when:
- Event sourcing (order-created, order-paid, order-shipped — the full audit trail)
- Multiple teams consuming the same event stream independently
- Real-time + batch processing from the same data (Lambda architecture)
- Change data capture (Debezium streams database changes to Kafka)
SQS fits when:
- Background job processing (image resizing, PDF generation)
- Decoupling microservices on AWS without managing infrastructure
- Workloads with unpredictable spikes (SQS scales infinitely)
Redis Streams fits when:
- You already have Redis and need lightweight streaming
- Chat messages, activity feeds, notification queues
- Prototyping before committing to Kafka
Patterns for System Design Interviews
Pattern 1: Async Task Processing
[Web Server] → [SQS Queue] → [Worker Fleet]
↓ (after 3 failures)
[Dead Letter Queue] → [Alert + Manual Review]
User uploads a video. Web server pushes a message to SQS. Worker fleet pulls, transcodes, stores. If a message fails 3 times, it hits the DLQ. An alarm triggers. Someone investigates. This is the most common messaging pattern in the industry.
Pattern 2: Event-Driven Microservices
[Order Service] → [Kafka: order-events]
↓ ↓ ↓
[Inventory] [Billing] [Analytics]
(group: inv) (group: bill) (group: data)
Each service has its own consumer group. Each reads the same events independently. If analytics is down for an hour, it catches up by reading from its last committed offset. No data lost. Uber and Netflix both use this pattern extensively.
Pattern 3: Priority Task Queue
[API Server] → [RabbitMQ Exchange] → [Priority Queue]
priority 9: critical alerts
priority 5: standard jobs
priority 1: batch reports
↓
[Worker Pool]
Workers always grab the highest-priority message first. A batch report generation won't block a critical security alert. You can't do this with Kafka or SQS without maintaining separate queues and custom consumer logic.
Trade-offs Table
| Trade-off | Choose A | Choose B |
|---|---|---|
| Replay vs Simplicity | Kafka (replay, audit trail) | RabbitMQ (simpler, messages gone after ACK) |
| Throughput vs Routing | Kafka (millions/sec, no routing) | RabbitMQ (lower throughput, rich routing) |
| Control vs Ops | Self-hosted Kafka/RabbitMQ (full control) | SQS (zero ops, less flexibility) |
| Latency vs Durability | RabbitMQ (1ms, in-memory delivery) | Kafka (5-10ms, always written to disk) |
| Single consumer vs Multi | SQS (one consumer per message) | Kafka (multiple consumer groups) |
| Priority vs Parallelism | RabbitMQ (priority queues) | Kafka (partition-level parallelism) |

Interview Gotchas
Gotcha 1: Kafka Is Not a Queue
If you say "we'll use Kafka as a message queue," you're already confused. Kafka is a log. Messages aren't deleted after consumption. Multiple consumer groups can read the same data. This changes your architecture fundamentally — embrace it, don't fight it.
Gotcha 2: SQS Doesn't Guarantee Ordering (Standard)
Standard SQS queues are best-effort ordering only. If you need strict FIFO, use FIFO queues — but they cap at 3,000 msg/sec without high throughput mode. Interviewers love asking about this gap.
Gotcha 3: RabbitMQ Clustering Is Not Kafka Replication
RabbitMQ can cluster, but it's not designed for the kind of horizontal scaling Kafka does. A RabbitMQ cluster is primarily for HA, not throughput. If you need 500K msg/sec, RabbitMQ will struggle even with clustering.
Gotcha 4: Kafka Partition Count Is (Mostly) Permanent
You can add partitions but you can't remove them. And adding partitions changes key-to-partition mapping. Plan your partition count carefully upfront, because changing it later means reprocessing or data skew.
Gotcha 5: Don't Default to Kafka
In every queue-related interview question, candidates default to Kafka. It's the safe-sounding answer. But if the problem is "send emails in the background," you don't need a distributed commit log with ZooKeeper. You need SQS or even a simple database-backed queue. Show you can pick the right tool, not the fanciest one.