Design a Multi-Channel Notification Service

TL;DR

A notification service delivers messages to users across multiple channels: push (APNs, FCM), SMS (Twilio), email (SES), and in-app (WebSocket). The hard part is not sending a notification -- it is deciding not to send one. Per-user rate limiting, quiet hours, aggregation ("John and 5 others liked your post"), priority levels, and retry with dead letter queues are what separate a production notification system from a spam cannon. Amazon's notification system sends billions of messages daily -- order confirmations, delivery updates, deal alerts -- and getting the delivery guarantees, deduplication, and channel routing right at that scale is a legitimate engineering challenge.

The System

A notification service takes an event from upstream (a new follower, a price drop, an order shipped) and delivers a message to the right user through the right channel at the right time. The user might receive a push notification on their phone, an email in their inbox, an SMS to their phone number, or an in-app badge -- or some combination of these.

Amazon sends over 1 billion notifications per day across push, email, and SMS. Facebook's notification system handles trillions of events per week (likes, comments, friend requests, group posts). Uber sends real-time push notifications for ride status updates to millions of concurrent users. Slack delivers in-app notifications via WebSocket to millions of connected clients simultaneously. Each of these systems manages complex routing logic: what channel to use, when to batch notifications, when to suppress them entirely, and how to handle delivery failures across unreliable third-party providers like APNs, FCM, and Twilio.

Requirements

Functional

Multi-channel delivery: Send notifications via push (iOS APNs, Android FCM), SMS, email, and in-app (WebSocket/SSE)
User preferences: Users configure per-channel opt-in/opt-out and per-notification-type preferences
Priority levels: Support critical (deliver immediately, bypass rate limits), high (deliver within seconds), medium (batch-eligible), and low (digest-eligible) priorities
Aggregation: Group related notifications -- "John, Sarah, and 3 others liked your photo" instead of 5 separate notifications
Quiet hours: Suppress non-critical notifications during user-configured quiet hours (e.g., 10 PM - 7 AM in the user's timezone)
Template engine: Notifications are rendered from templates with variable substitution, supporting localization

Non-Functional

Delivery latency: Critical notifications delivered within 2 seconds. High-priority within 10 seconds. Medium batched up to 5 minutes
Throughput: 10 billion notifications/day = ~115,000 notifications/sec
At-least-once delivery: Every notification must be delivered at least once (duplicates are preferable to drops)
Deduplication: Do not send the same notification twice to the same user on the same channel within a dedup window
Channel reliability: Handle provider outages (APNs down, Twilio rate limited) with automatic failover or queued retry
Deliverability: Email sender reputation management. SMS opt-in compliance. Push token lifecycle management

Back-of-Envelope Math

Volume:
  10B notifications/day = 115K notifications/sec
  Peak (3x during events like Black Friday): 345K/sec

Channel distribution (typical):
  Push: 50% = 57.5K/sec
  Email: 30% = 34.5K/sec
  In-app: 15% = 17.25K/sec
  SMS: 5% = 5.75K/sec

Third-party API limits:
  APNs: ~50K concurrent connections, ~10K notifications/sec per connection
  FCM: 5000 topics, 1000 messages/sec per topic, bulk send up to 500/request
  Twilio: 100 messages/sec per phone number (10DLC), higher with short codes
  SES: 50K emails/sec (with warming), 14 emails/sec per recipient per 24 hrs

Worker sizing:
  Each worker processes 1 notification in ~50ms (template render + API call)
  1 worker = 20 notifications/sec
  115K/sec / 20 = 5,750 workers (steady state)
  Peak: ~17,250 workers

Storage:
  Notification log: 10B/day * 200 bytes = 2 TB/day
  Retention: 30 days = 60 TB
  User preferences: 500M users * 500 bytes = 250 GB

The Naive Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Event Source │────>│  Notif API   │────>│  Worker      │
│  (service)   │     │              │     │  (sends)     │
└──────────────┘     └──────────────┘     └──────────────┘

def send_notification(user_id, message, channel):
    if channel == "push":
        apns.send(user.device_token, message)
    elif channel == "email":
        ses.send(user.email, message)
    elif channel == "sms":
        twilio.send(user.phone, message)

One API, one worker, direct calls to providers. For a startup sending 1,000 notifications/day, this works.

Where Does This Break First?

The synchronous provider call. APNs takes 50-200ms to respond. If your API call blocks on APNs for 200ms, your throughput is 5 notifications/sec per worker thread. At 115K/sec, you need 23,000 threads. And when APNs has a 30-second timeout (which happens), those threads pile up, your thread pool exhausts, and the entire notification path for all channels -- including email and SMS -- goes down because of one provider.

Where It Breaks

Problem 1: Provider coupling. Synchronous calls to external providers (APNs, FCM, Twilio, SES) couple your notification system's availability to theirs. APNs has documented outages lasting 30+ minutes. During an outage, all push notifications queue up, and if your queue is in-memory, they are lost. If you share workers across channels, an APNs outage blocks SMS and email too.

Problem 2: No rate limiting per user. Without per-user rate limiting, a user who gets 50 likes in 1 minute receives 50 push notifications. This is how you get 1-star app reviews. Users call it spam. It is spam.

Problem 3: No deduplication. If a service sends the same notification event twice (at-least-once upstream), the user gets duplicate notifications. "Your order has shipped" x2 is annoying. "Your account has been charged $499" x2 causes panic.

Problem 4: No quiet hours or preferences. Waking someone at 3 AM with "Someone liked your comment!" is a guaranteed uninstall. Users need to control when and how they are notified.

Problem 5: Flat priority model. A "your house is on fire" alert from a security system and a "weekly newsletter" are treated identically. Under load, the newsletter might be delivered before the security alert.

The Real Design

┌──────────────────────────────────────────────────────────────┐
│                      Event Sources                            │
│  (Order Service, Social Service, Security Service, etc.)      │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           v
┌──────────────────────────────────────────────────────────────┐
│                  Notification API Service                      │
│  - Validate event                                             │
│  - Check user preferences (opt-in/opt-out)                    │
│  - Assign priority                                            │
│  - Dedup check (Redis: SETNX with TTL)                        │
│  - Enqueue to priority-routed Kafka topics                    │
└──────────────┬───────────┬───────────┬───────────────────────┘
               │           │           │
      ┌────────v──┐  ┌─────v────┐  ┌───v───────┐
      │ Kafka     │  │ Kafka    │  │ Kafka     │
      │ critical  │  │ high     │  │ medium    │
      │ (P0)      │  │ (P1)     │  │ (P2)      │
      └────────┬──┘  └─────┬────┘  └───┬───────┘
               │           │           │
               v           v           v
┌──────────────────────────────────────────────────────────────┐
│                    Channel Router                             │
│  - Quiet hours check (user timezone)                          │
│  - Per-user rate limiting (Redis sliding window)              │
│  - Aggregation buffer (for medium/low priority)               │
│  - Route to per-channel worker queues                         │
└──────┬──────────┬──────────┬──────────┬──────────────────────┘
       │          │          │          │
  ┌────v───┐ ┌───v────┐ ┌───v────┐ ┌───v────────┐
  │ Push   │ │ Email  │ │ SMS    │ │ In-App     │
  │ Workers│ │ Workers│ │ Workers│ │ (WebSocket)│
  └────┬───┘ └───┬────┘ └───┬────┘ └───┬────────┘
       │         │         │          │
       v         v         v          v
  ┌────────┐ ┌───────┐ ┌───────┐ ┌──────────┐
  │ APNs / │ │  SES  │ │Twilio │ │ WS Conn  │
  │ FCM    │ │       │ │       │ │ Manager  │
  └────────┘ └───────┘ └───────┘ └──────────┘
       │         │         │          │
       v         v         v          v
  ┌──────────────────────────────────────────────┐
  │              Dead Letter Queue               │
  │  (failed after max retries)                  │
  └──────────────────────────────────────────────┘

Priority Levels and Routing

Not all notifications are equal. A security alert ("Unusual login from Russia") must bypass every optimization and reach the user instantly. A "Weekly digest" can wait hours.

P0 (Critical):
  - Security alerts, fraud detection, emergency notifications
  - Bypass: quiet hours, rate limits, aggregation
  - Delivery target: < 2 seconds
  - Channel: all enabled channels simultaneously

P1 (High):
  - Order updates, direct messages, mentions
  - Subject to: quiet hours (downgrade to in-app if in quiet hours)
  - Delivery target: < 10 seconds
  - Channel: primary channel (push or in-app)

P2 (Medium):
  - Social interactions (likes, follows, comments)
  - Subject to: rate limiting, aggregation, quiet hours
  - Delivery target: < 5 minutes (batch window)
  - Channel: single best channel

P3 (Low):
  - Recommendations, newsletters, re-engagement
  - Subject to: aggressive batching, daily digest
  - Delivery target: next morning or digest window
  - Channel: email or in-app only

Kafka topics are partitioned by priority. P0 has its own consumer group with 10x the worker count relative to volume, ensuring it never queues. P2 and P3 share a consumer group with fewer workers, allowing natural batching through backlog.

Per-User Rate Limiting

A user's phone buzzes 50 times because 50 people liked their photo? That is a product failure, not a success.

def should_rate_limit(user_id, channel, notification_type):
    key = f"rate:{user_id}:{channel}:{notification_type}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 3600)  # 1-hour window

    limits = {
        "push": 10,      # max 10 push notifications per hour per type
        "sms": 3,         # max 3 SMS per hour (SMS costs money!)
        "email": 5,       # max 5 emails per hour per type
        "in_app": 50       # in-app is cheap, higher limit
    }
    return count > limits.get(channel, 10)

When the rate limit is hit, the notification is not dropped -- it is aggregated. Instead of notification #11 being "Alice liked your photo," it becomes part of the aggregated notification: "Alice and 42 others liked your photo."

Aggregation Engine

Aggregation groups related notifications within a time window and sends a single summary instead of N individual notifications.

# Aggregation buffer (Redis hash)
# Key: "agg:{user_id}:{notification_type}:{object_id}"
# Fields: actors (list of user IDs), count, first_seen, last_seen

def aggregate_or_send(notification):
    agg_key = f"agg:{notification.user_id}:{notification.type}:{notification.object_id}"

    # Add actor to aggregation buffer
    redis.hincrby(agg_key, "count", 1)
    redis.rpush(f"{agg_key}:actors", notification.actor_id)

    if not redis.exists(f"{agg_key}:timer"):
        # First notification in this aggregation window
        redis.set(f"{agg_key}:timer", 1, ex=300)  # 5-minute window
        # Schedule a flush after 5 minutes
        schedule_flush(agg_key, delay=300)

def flush_aggregation(agg_key):
    count = redis.hget(agg_key, "count")
    actors = redis.lrange(f"{agg_key}:actors", 0, 2)  # first 3 actors

    if count == 1:
        send("John liked your photo")
    elif count == 2:
        send("John and Sarah liked your photo")
    elif count <= 5:
        send(f"John, Sarah, and {count-2} others liked your photo")
    else:
        send(f"John, Sarah, and {count-2} others liked your photo")

    redis.delete(agg_key, f"{agg_key}:actors", f"{agg_key}:timer")

Aggregation flush trigger: Either a timer (5 minutes for medium priority) or a threshold (send immediately if 10+ actors accumulate, because "10 people liked your photo" is exciting enough to push immediately).

Quiet Hours

Users set quiet hours in their preferences: "No push/SMS between 10 PM and 7 AM."

def check_quiet_hours(user_id, channel, priority):
    if priority == "P0":
        return False  # never suppress critical

    prefs = get_user_preferences(user_id)
    if not prefs.quiet_hours_enabled:
        return False

    user_local_time = convert_to_timezone(utcnow(), prefs.timezone)
    return prefs.quiet_start <= user_local_time.hour or user_local_time.hour < prefs.quiet_end

During quiet hours: push and SMS are suppressed. Email is queued for delivery at quiet hours end. In-app is delivered (silently -- no badge increment, just available when the user opens the app).

Timezone handling: Store the user's timezone (e.g., "America/New_York"), not an offset. Offsets change with DST. When checking quiet hours, convert UTC to the user's local time using the timezone database. This is a common source of bugs -- many notification systems have woken users at 2 AM because DST shifted and the system used a hardcoded offset.

Template Engine

Notification content is defined as templates, not hardcoded strings. This allows localization, A/B testing, and content changes without code deployment.

{
  "template_id": "order_shipped",
  "channel": "push",
  "locale": "en_US",
  "title": "Your order is on its way!",
  "body": "Your order #{{order_id}} shipped via {{carrier}}. Track: {{tracking_url}}"
}

Templates are stored in a database and cached (5-minute TTL). The notification worker renders the template by substituting variables from the event payload:

def render_template(template_id, channel, locale, variables):
    template = template_cache.get(f"{template_id}:{channel}:{locale}")
    if not template:
        template = db.get_template(template_id, channel, locale)
        template_cache.set(f"{template_id}:{channel}:{locale}", template, ttl=300)

    return template.render(**variables)

Fallback chain: try en_US -> en -> default locale. Never send a notification in the wrong language.

Deep Dives

Price Alert Crawl Priority

Deep Dive 1: Retry Strategy and Dead Letter Queue

Third-party providers fail. APNs returns 503. Twilio rate limits you. SES bounces. You need a retry strategy that does not make things worse.

Retry with exponential backoff + jitter:

Attempt 1: immediate
Attempt 2: 1 second + random(0, 500ms)
Attempt 3: 4 seconds + random(0, 2s)
Attempt 4: 16 seconds + random(0, 8s)
Attempt 5: 60 seconds + random(0, 30s)
Max 5 attempts total.

Error classification matters:

def should_retry(error):
    if error.status == 400:  # Bad request (invalid token, wrong format)
        return False          # Permanent failure, do not retry
    if error.status == 410:  # Gone (user uninstalled app, token invalid)
        invalidate_device_token(error.device_token)
        return False          # Permanent, clean up
    if error.status == 429:  # Rate limited
        return True           # Transient, retry with backoff
    if error.status >= 500:  # Server error
        return True           # Transient, retry
    return False

Do not retry on 400s (the request is malformed, retrying sends the same bad request). Do not retry on 410s (the device token is invalid, the user uninstalled the app -- clean up the token). Only retry on 429s and 5xx errors.

Dead letter queue: After 5 failed attempts, move the notification to a DLQ table. An alert fires if the DLQ grows beyond a threshold. On-call engineers investigate. Common causes: expired device tokens (clean up in bulk), provider outage (wait for recovery, then replay), configuration error (fix and replay).

DLQ replay: The DLQ supports manual replay. An operator can select notifications by time range and provider error code, fix the root cause, and replay them. Replayed notifications go through the full pipeline (including dedup) so users do not get double-notified if the original notification actually delivered but the ACK was lost.

Deep Dive 2: Push Notification Delivery Guarantees

Push notifications (APNs, FCM) are fundamentally unreliable. Understanding why is important.

APNs (Apple Push Notification Service):

APNs uses HTTP/2 with persistent connections
Delivery is best-effort: APNs does NOT guarantee delivery. If the device is offline, APNs stores the most recent notification per app (not all notifications) and delivers it when the device comes online
If you send 5 notifications while the user is offline, only the last one is delivered
Token lifecycle: when a user uninstalls the app, the device token becomes invalid. APNs returns 410 Gone. You must remove the token from your database
Feedback service: APNs provides a feedback endpoint listing invalid tokens. Poll it daily and clean up

FCM (Firebase Cloud Messaging):

Supports up to 100 pending messages per device (vs. APNs' 1)
Collapsible messages (same collapse_key): only the latest is delivered if the device is offline
Non-collapsible messages: all are queued and delivered when the device comes online (up to 100)
FCM topics allow sending to groups of users without enumerating device tokens

Handling unreliable delivery:

For critical notifications (security alerts, order updates), do not rely on push alone. Use a multi-channel strategy:

1. Send push notification (immediate, best-effort)
2. Send in-app notification (WebSocket if connected, stored for next open)
3. If push delivery not confirmed within 5 minutes: send email
4. If extremely critical (security): also send SMS

The in-app notification is the reliable channel. It is stored in your database and shown when the user opens the app, regardless of push delivery status. Push is the "tap on the shoulder" that gets the user to open the app.

Deep Dive 3: Notification Analytics and Feedback Loops

A notification system without analytics is flying blind. You need to know: did the notification reach the user? Did they see it? Did they act on it?

Key metrics:

1. Send rate: notifications enqueued per second (by type, priority, channel)
2. Delivery rate: notifications accepted by provider (APNs 200, SES accepted)
3. Open rate: notifications opened/tapped by user (tracked via deep link + callback)
4. Click-through rate: user took the intended action after opening
5. Unsubscribe rate: users who opted out after receiving (high = you are spamming)
6. Bounce rate (email): hard bounces (invalid address) vs soft bounces (full inbox)
7. DLQ rate: notifications that failed all retries

Feedback loop: If the unsubscribe rate for a notification type exceeds 2%, automatically reduce its frequency. If the open rate drops below 5%, the notification is not adding value -- consider deprecating it or changing the trigger.

Email sender reputation: SES and SendGrid track your bounce rate and complaint rate. If your complaint rate exceeds 0.1% (1 in 1,000 recipients mark your email as spam), SES throttles your sending. If it reaches 0.5%, SES suspends your account. Monitor this daily.

Sender reputation health:
  Bounce rate < 5%:     Healthy
  Complaint rate < 0.1%: Healthy
  Bounce rate 5-10%:    Warning (clean your email list)
  Complaint rate > 0.1%: Critical (stop sending, investigate)

Alternative Designs

Use AWS SNS for push/SMS, SES for email, and Pinpoint for campaign management. No custom notification infrastructure.

Each notification event publishes to an SNS topic. Per-channel SQS queues subscribe to the topic. Channel-specific Lambda functions consume from each queue.

Alternative 3: Unified Push with Fallback

Use a service like OneSignal or Firebase for all push channels, with custom fallback to email/SMS for undelivered push.

Aspect	Custom (Kafka + Workers)	Managed (SNS/SES)	Event-Driven (SNS+SQS)	Unified Push Service
Aggregation	Full control	None (DIY)	None (DIY)	Limited
Per-user rate limiting	Full control	None	None	Basic
Quiet hours	Full control	None	None	Timezone-based scheduling
Throughput	100K+/sec	50K/sec (SES limit)	100K+/sec	10K/sec (API limits)
Cost at 10B/day	~$5K/mo (infra)	~$15K/mo (per-message)	~$10K/mo	~$20K/mo
Ops complexity	High	Low	Medium	Low
Template engine	Custom	SES templates	Custom	Built-in
Analytics	Custom	CloudWatch + custom	Custom	Built-in

Build custom if you need aggregation, per-user rate limiting, and quiet hours (most consumer apps at scale). Use managed services for B2B SaaS where notification volumes are lower and speed-to-market matters. Use event-driven for microservice architectures where each service manages its own notifications.

Scaling Math Verification

115K notifications/sec steady state:

Kafka: 3 priority topics * 12 partitions each = 36 partitions. At 115K msgs/sec, each partition handles 3.2K msgs/sec. Kafka handles 100K+ msgs/sec per partition. Enormous headroom.
Channel workers: push 57.5K/sec at 50ms per send = 2,875 workers. Email 34.5K/sec at 30ms per send = 1,035 workers. SMS 5.75K/sec at 100ms per send = 575 workers. In-app 17.25K/sec at 5ms per send = 86 workers. Total: ~4,571 workers.
Redis (rate limiting + dedup): 115K SETNX + 115K INCR = 230K ops/sec. Single Redis instance handles this. Add a replica for HA.
Notification log: 115K writes/sec * 200 bytes = 23 MB/sec. Kafka to S3 (via Kafka Connect) or ClickHouse for analytics.

Provider API headroom:

APNs: 57.5K iOS pushes/sec. APNs supports ~200K/sec per provider connection pool. 29% utilization.
FCM: split Android push similarly. FCM handles 100K+/sec.
SES: 34.5K emails/sec. SES limit: 50K/sec (after warming). 69% utilization. Tight at peak (3x) -- need to pre-warm.
Twilio: 5.75K SMS/sec. Need ~58 phone numbers at 100 msgs/sec each. Or use short codes (1000 msgs/sec).

Failure Analysis

Component	Current capacity	At 10x (1.15M notifs/sec)	Breaks?	Fix
Kafka cluster	36 partitions	360 partitions needed	No	Add brokers, Kafka scales linearly
Push workers	2,875 workers	28,750 workers	Maybe	K8s auto-scaling, spot instances
Redis (rate limit)	230K ops/sec	2.3M ops/sec	Yes	Redis Cluster with 10 shards
APNs throughput	57.5K/sec	575K/sec	Yes	Multiple provider connections, HTTP/2 multiplexing
SES throughput	34.5K/sec	345K/sec	Yes	Multiple SES regions, dedicated IPs
Twilio throughput	5.75K/sec	57.5K/sec	Yes	Short codes, multiple Twilio sub-accounts
User preferences cache	250 GB	2.5 TB	Yes	Shard preferences across Redis Cluster
Notification log	2 TB/day	20 TB/day	Yes	Time-partitioned storage, S3 cold tier

The first bottleneck at 10x is third-party provider throughput. APNs and SES have limits that require multiple connections, multiple accounts, or multiple regions. This is operationally complex but technically straightforward.

The second bottleneck is the notification log at 20 TB/day. At this scale, you need a columnar store (ClickHouse or Redshift) with time-based partitioning and automatic archival to S3 after 7 days.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
Channel routing	Push + email, hardcoded	4 channels with user preference check	Priority-based routing, fallback chains, multi-channel strategy
Rate limiting	Not mentioned	Global rate limit on notification volume	Per-user, per-type, per-channel rate limits with aggregation
Aggregation	Not mentioned	Mentions "batch similar notifications"	Time-window aggregation, threshold flush, "N others" pattern
Quiet hours	Not mentioned	"Don't send at night"	Timezone-aware with DST handling, priority bypass
Retry/DLQ	"Retry on failure"	Exponential backoff, max retries	Error classification (4xx vs 5xx), DLQ with replay, circuit breaker
Deduplication	Not mentioned	Idempotency key check	Redis SETNX with TTL, upstream event dedup vs delivery dedup
Provider management	Direct API calls	Async with message queue	Connection pooling, token lifecycle, sender reputation management
Template engine	Hardcoded strings	Templates with variables	Localization, A/B testing, fallback locale chain

The single most important signal at any level: do you understand that the notification system's job is as much about not sending as it is about sending? Rate limiting, aggregation, quiet hours, and deduplication are the features that make users keep an app installed. A system that sends every event as a separate push notification is a spam cannon, and users will disable notifications entirely.

References from Our Courses

Kafka Partitions and Ordering — routing notifications to channel-specific consumers
RabbitMQ, Kafka, and SQS — priority queues for urgent vs. batch notifications
Delivery Guarantees — at-least-once delivery across push, SMS, and email

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Multi-Channel Notification Service →