Skip to content

Queue Architectures

TL;DR

Four queue technologies dominate system design: SQS (managed, nearly infinite scale, zero ops), RabbitMQ (rich routing with exchanges, priority queues, ACK-based delivery), Redis/BullMQ (fast and simple with a built-in dashboard, but memory-limited), and Kafka (500K+ msg/s, persistent log you can replay, but no per-message ACK, no priority queues, no delayed delivery). Pick based on your actual constraints, not hype.


The Core Pattern

Every queue-based system follows the same skeleton. The technology choice changes the capabilities, not the shape.

Producer-queue-consumer pattern: API servers publishing to queue, workers consuming

Producer sends a message. Queue stores it durably. Consumer picks it up, processes it, and acknowledges completion. If the consumer crashes before acknowledging, the message goes back to the queue.

That's the contract. Everything else is implementation detail.


The Big Four: Head-to-Head

Feature SQS RabbitMQ Redis / BullMQ Kafka
Type Managed service Self-hosted broker In-memory store + lib Distributed log
Throughput ~120K msg/s ~20-50K msg/s ~100K+ msg/s 500K+ msg/s
Delivery guarantee At-least-once At-least-once (ACK) At-least-once At-least-once
Ordering FIFO queues only Per-queue Per-queue Per-partition
Priority queues No Yes (0-255 levels) Yes (via BullMQ) No
Delayed delivery Yes (up to 15 min) Yes (via plugin/TTL) Yes (native in BullMQ) No
Dead letter queue Yes (native) Yes (native) Yes (via BullMQ) Manual (topic redirect)
Message replay No (deleted on ACK) No (deleted on ACK) No (deleted on ACK) Yes (retained by offset)
Per-message ACK Yes Yes Yes No (offset-based)
Ops burden Zero (AWS managed) Medium (clustering) Low-Medium High (ZooKeeper/KRaft)
Best for General job queues Complex routing Fast jobs + dashboard Event streaming

SQS: The "Just Works" Queue

Amazon SQS is the default choice when you're on AWS and need a job queue with zero operational overhead.

How It Works

SQS visibility timeout: message becomes invisible during processing

Visibility Timeout: The Key Concept

When a worker receives a message, SQS doesn't delete it -- it hides it for a configurable duration (the visibility timeout). If the worker finishes and deletes the message, great. If the worker crashes, the timeout expires and the message reappears for another worker.

import boto3

sqs = boto3.client("sqs")
queue_url = "https://sqs.us-east-1.amazonaws.com/123456789/reports"

# Send a message
sqs.send_message(
    QueueUrl=queue_url,
    MessageBody='{"job_id": "abc123", "type": "report"}',
    DelaySeconds=0  # up to 900 seconds (15 minutes)
)

# Receive and process
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    VisibilityTimeout=300,  # 5 minutes to process
    WaitTimeSeconds=20      # long polling -- reduces empty responses
)

for msg in response.get("Messages", []):
    process(msg["Body"])
    sqs.delete_message(
        QueueUrl=queue_url,
        ReceiptHandle=msg["ReceiptHandle"]
    )

Long polling saves money

WaitTimeSeconds=20 means "wait up to 20 seconds for a message before returning empty." Without it, workers spam ReceiveMessage every 100ms and you pay per request.

SQS FIFO Queues

Standard SQS doesn't guarantee ordering. FIFO queues do, but with trade-offs:

  • Throughput cap: 300 msg/s (3,000 with batching) vs. nearly unlimited for standard
  • Exactly-once processing: Deduplication via MessageDeduplicationId
  • Message groups: Order is guaranteed within a group, parallelism across groups

Use FIFO when order matters (financial transactions). Use Standard when it doesn't (report generation, image processing).


RabbitMQ: The Routing Powerhouse

RabbitMQ shines when you need flexible message routing. Its exchange system lets you build patterns that SQS can't touch.

Exchange Types

RabbitMQ exchange types: direct, fanout, and topic routing

Exchange Type Routing Logic Use Case
Direct Exact match on routing key Job type routing (pdf, csv, email)
Fanout Broadcast to all bound queues Event notifications, audit logging
Topic Pattern match with * and # Region/tier-based routing
Headers Match on message headers Complex multi-attribute routing

Prefetch Count: Flow Control

Prefetch controls how many unacknowledged messages a worker can hold. This is critical for balancing throughput and fairness.

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
channel = connection.channel()

# Each worker gets at most 5 messages at a time
channel.basic_qos(prefetch_count=5)

def callback(ch, method, properties, body):
    result = process_job(body)
    # Only ACK after successful processing
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue="reports", on_message_callback=callback)
channel.start_consuming()
Prefetch Count Behavior
1 Strict round-robin, but low throughput
5-20 Good balance for most workloads
100+ High throughput, but one slow consumer hoards messages
0 (unlimited) Consumer grabs everything -- defeats load balancing

Priority Queues

RabbitMQ supports up to 255 priority levels. Higher priority messages get delivered first.

# Declare queue with max priority
channel.queue_declare(
    queue="jobs",
    arguments={"x-max-priority": 10}
)

# Publish with priority
channel.basic_publish(
    exchange="",
    routing_key="jobs",
    body='{"type": "payment_refund"}',
    properties=pika.BasicProperties(priority=9)  # urgent
)

Redis / BullMQ: The Developer's Favorite

Redis-backed job queues (BullMQ for Node.js, RQ for Python, Sidekiq for Ruby) offer the best developer experience: simple API, built-in dashboard, and Redis speeds.

import { Queue, Worker } from "bullmq";
import Redis from "ioredis";

const connection = new Redis({ host: "localhost", port: 6379 });

// Create a queue
const reportQueue = new Queue("reports", { connection });

// Add a job with options
await reportQueue.add(
  "generate-pdf",
  { userId: "u_123", reportType: "annual" },
  {
    priority: 1,           // lower number = higher priority
    delay: 5000,           // wait 5s before processing
    attempts: 3,           // retry up to 3 times
    backoff: {
      type: "exponential",
      delay: 1000           // 1s, 2s, 4s
    },
    removeOnComplete: 1000, // keep last 1000 completed jobs
    removeOnFail: 5000      // keep last 5000 failed jobs
  }
);

// Process jobs
const worker = new Worker("reports", async (job) => {
  const pdf = await generateReport(job.data);
  await uploadToS3(pdf);
  return { url: pdf.url };
}, { connection, concurrency: 5 });

BullMQ Dashboard

Bull Board gives you a real-time UI showing active, waiting, completed, and failed jobs. During incidents, this visibility is worth its weight in gold compared to staring at SQS metrics in CloudWatch.

The Memory Ceiling

Redis stores everything in RAM. A million messages at 1KB each = 1GB of memory. For most job queues this is fine, but if your queue can grow unbounded during an outage, you need to plan for it.

# Redis memory limit
maxmemory 2gb
maxmemory-policy noeviction  # CRITICAL: never evict queue data

Never use allkeys-lru with job queues

If Redis runs out of memory with an LRU eviction policy, it will silently drop your job messages. Always use noeviction and set up memory alerts.


Kafka: The Distributed Log

Kafka is fundamentally different from the others. It's not a job queue -- it's a persistent, ordered, replayable event log. People use it as a job queue, but they should understand what they're giving up.

How Kafka Actually Works

Kafka partitions: topic split into ordered partitions with consumer group

Key concepts:

  • Topics are split into partitions
  • Each partition is an ordered, append-only log
  • Consumer groups divide partitions among consumers -- each partition goes to exactly one consumer in the group
  • Consumers track their offset (position in the log) -- they can rewind and replay

What Kafka Gives You That Others Don't

Replay: A consumer can rewind to any offset and reprocess. Deployed a bug that corrupted results? Fix the bug, reset the offset, reprocess everything. SQS and RabbitMQ delete messages on acknowledgment -- there's no going back.

Throughput: LinkedIn processes 7+ trillion messages per day through Kafka. The append-only log design is fundamentally faster than random-access broker queues.

What Kafka Takes Away

Missing Feature Impact on Job Queues
No per-message ACK Can't selectively retry message #47. You commit an offset, meaning "everything up to here is done."
No priority queues Can't rush a payment refund ahead of analytics jobs
No delayed delivery Can't say "process this in 5 minutes"
No automatic redelivery If a consumer crashes, it replays from last committed offset -- possibly reprocessing already-done work
Partition = parallelism Want 20 consumers? Need at least 20 partitions. Can't dynamically scale consumers beyond partition count.
from confluent_kafka import Consumer, Producer

# Producer
producer = Producer({"bootstrap.servers": "kafka:9092"})
producer.produce(
    topic="report-jobs",
    key="user_123",     # determines partition (same key = same partition = ordering)
    value='{"job_id": "abc", "type": "report"}'
)
producer.flush()

# Consumer
consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "report-workers",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False  # manual commit for control
})
consumer.subscribe(["report-jobs"])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    process(msg.value())
    consumer.commit(message=msg)  # "everything up to here is done"

Don't use Kafka as a job queue unless you understand the trade-offs

Kafka is brilliant for event streaming, CDC, and log aggregation. But as a job queue, the lack of per-message ACK, priority, and delayed delivery means you'll end up building those features yourself on top of Kafka. At that point you've built a worse version of RabbitMQ.


Decision Framework

Queue technology decision framework: SQS, RabbitMQ, Kafka, or Redis/BullMQ

Quick Reference

Scenario Best Choice Why
Standard job queue on AWS SQS Zero ops, scales automatically, DLQ built in
Multi-tenant with priority RabbitMQ Priority queues + exchange routing
Startup, Node.js stack BullMQ Best DX, great dashboard, fast iteration
Event sourcing / CDC Kafka Replay, retention, ordering guarantees
Analytics pipeline Kafka High throughput, consumer groups, reprocessing
Hybrid (jobs + events) SQS + Kafka SQS for jobs, Kafka for event streaming

Combining Technologies

In practice, mature systems use multiple queue technologies for different workloads.

Uber uses Kafka for real-time trip event streaming (millions of events/sec), but uses Cherami (their custom queue, later open-sourced) for task-oriented work like driver notifications and payment processing where per-message delivery guarantees matter.

Stripe routes payment events through Kafka for downstream consumers, but uses a custom Redis-based job queue for webhook delivery where retry schedules and per-message control are essential.


Key Takeaways

Concept Details
SQS Managed, ~120K msg/s, visibility timeout, FIFO optional
RabbitMQ Exchange routing, prefetch control, priority 0-255
Redis/BullMQ In-memory speed, dashboard, noeviction policy required
Kafka Append-only log, offset tracking, replay, no per-message ACK
Key question "Do I need a job queue or an event stream?"