Design an Encrypted Messaging Service

TL;DR

You are designing WhatsApp. Not "a chat app" -- WhatsApp specifically, because the interviewer wants to hear about end-to-end encryption, store-and-forward offline delivery, and how 35 engineers served 900 million users on Erlang. The server is a blind relay that cannot read a single message. That constraint shapes every architectural decision.

The core tension: deliver messages reliably to devices that are usually offline, encrypt them so even you cannot read them, and do it all with sub-second latency for 2 billion daily active users sending 100 billion messages a day.

The System

WhatsApp is a 1:1 and small-group encrypted messaging service. Users send text, images, video, and voice messages to individuals or groups of up to 1,024 members. Every message is end-to-end encrypted using the Signal Protocol. The server stores messages only until the recipient's device downloads them, then deletes them.

This is fundamentally different from Slack or Discord. There is no server-side search. There is no message history on the server. There are no channels. The server is a dumb pipe that happens to be very, very good at delivering encrypted blobs to the right phone.

Real-world reference: WhatsApp ran on ~550 servers with ~35 engineers at the time of the $19.3B Facebook acquisition in 2014, serving 450 million users. By 2025, it serves 3 billion monthly active users.

Requirements

Functional Requirements

1:1 messaging -- send and receive text messages between two users.
Group messaging -- send messages to groups of up to 1,024 members.
End-to-end encryption -- server cannot read message content.
Offline delivery -- messages are stored until the recipient comes online.
Delivery and read receipts -- single check (sent), double check (delivered), blue checks (read).
Media sharing -- images, videos, voice messages sent separately from text.
Online/last-seen status -- presence indicators.

Non-Functional Requirements

Latency: < 500ms message delivery when both users are online.
Availability: 99.99% uptime (< 53 minutes downtime/year).
Durability: Zero message loss -- every message must be delivered at least once.
Ordering: Per-conversation ordering. No global ordering needed.
Scale: 2B DAU, 100B messages/day, 500M-1B peak concurrent connections.

Back-of-Envelope Math

Metric	Value	Derivation
Messages/second	~1.15M	100B / 86,400
Avg text message size	~100 bytes	Content + encryption overhead + metadata
Text throughput	~115 MB/sec	Trivial for modern infrastructure
Peak concurrent connections	~500M-1B	~25-50% of DAU
Connection servers needed	250-500	At 2M connections/server (WhatsApp's benchmark)
Messages per server per second	~2,300-4,600	1.15M / 250-500
Text storage (30-day TTL)	~300 TB	10 TB/day x 30 days
User metadata	~3 TB	3B users x 1 KB each (fits in memory)

The throughput numbers are almost boring. 115 MB/sec of text is nothing. The hard problem is not throughput -- it is holding 500 million persistent connections open and routing messages between them.

Naive Design

Client A (sender)
  -> HTTP POST /send {to: B, message: "hello"}
    -> API Server
      -> Write to Messages DB
        -> HTTP response to A: "sent"
        -> Push notification to B
          -> B opens app, HTTP GET /messages
            -> Returns new messages

This is how you would build a prototype in a weekend hackathon. HTTP request-response for everything. Push notifications for alerts. Poll for new messages.

Why it works for a demo

Simple to implement.
Standard REST patterns.
Push notifications wake the app.

Where It Breaks

Latency is terrible. Push notifications take 1-10 seconds to arrive. The user opens the app, makes an HTTP request, waits for the response. Total time from send to display: 5-15 seconds. Real chat needs sub-second.

Polling destroys your servers. If 500M users poll every 5 seconds, that is 100M HTTP requests per second. Each one requires a full TCP handshake, TLS negotiation, HTTP header parsing, database query, and response. Your API servers catch fire.

No encryption. The server reads and stores plaintext messages. One database breach and every message ever sent is public.

No ordering guarantees. HTTP requests can arrive out of order. Two messages sent milliseconds apart might be stored with swapped timestamps.

No connection state. The server has no idea whether User B is online or offline. It cannot route messages intelligently.

Real Design

Encrypted Messaging — Encrypted Messaging High-Level Design

High-Level Architecture

Client A (Erlang process per connection)
  |
  | Persistent TCP (FunXMPP binary protocol)
  v
Connection Server (Erlang/BEAM, 2M connections each)
  |
  +-> Session Registry (user_id -> connection_server mapping)
  |
  +-> Message Store (write message + inbox entry)
  |     |
  |     +-> Messages Table (partition by chat_id)
  |     +-> Inbox Table (partition by user_id, sorted queue)
  |
  +-> Pub/Sub Layer (route to recipient's connection server)
  |
  v
Connection Server holding User B's connection
  |
  | Persistent TCP
  v
Client B (decrypts with Signal Protocol)

Connection Layer: Erlang/BEAM

This is the single most important technology decision in WhatsApp's architecture and it deserves a dedicated explanation.

WhatsApp chose Erlang -- a language most engineers have never touched -- because the BEAM virtual machine was purpose-built for exactly this workload: millions of mostly-idle concurrent connections with occasional small messages.

Why Erlang crushes this:

Lightweight processes: Each connection gets its own Erlang process. An Erlang process uses ~2-3 KB of memory at startup. That is not a typo. 2 million processes = ~6 GB of process memory. An OS thread uses ~1 MB. Two million OS threads would need 2 TB of RAM.
Preemptive scheduling: BEAM schedules processes using "reductions" (roughly function calls), not time slices. No single chatty connection can starve others. When 99% of your connections are idle, this matters enormously.
Per-process GC: Each process has its own tiny heap and gets garbage collected independently. No global GC pauses. Compare this to JVM-based systems where a full GC pause can freeze millions of connections simultaneously. (Discord hit exactly this problem with Cassandra on the JVM -- more on that in Lesson 3.)
Hot code loading: Deploy new code to a running server without dropping a single connection. Users never notice upgrades.
"Let it crash" philosophy: If one connection handler crashes, its supervisor restarts it. The other 1,999,999 connections are unaffected.

The benchmark: A single WhatsApp server handled 2 million simultaneous TCP connections in 2012. By 2014, they pushed past 2.5 million. This means WhatsApp's entire 500M concurrent user base could be served by ~250 connection servers.

WhatsApp also chose FreeBSD over Linux. The kqueue event notification system and FreeBSD's network stack tuning gave more predictable behavior at extreme connection counts.

The custom protocol: FunXMPP. WhatsApp started with ejabberd (an Erlang XMPP server) and progressively gutted it. They replaced XML with a compact binary encoding (XML is absurdly verbose for chat -- a "hello" message wraps in ~200 bytes of XML tags). They stripped out federation entirely. They added phone-number-based addressing ([phone]@s.whatsapp.net). What remained was XMPP's session model with everything else replaced.

Session Registry: Finding the Right Server

When User A sends a message to User B, the system needs to know which of the ~250-500 connection servers is holding User B's TCP connection. Three approaches:

Option 1: Consistent hashing. Hash user_id to determine which connection server they should connect to. The load balancer routes accordingly. Simple, but rebalancing on server failure disrupts connections.

Option 2: Centralized registry (Redis/ZooKeeper). Maintain a user_id -> connection_server mapping. On connect, register. On disconnect, deregister. Lookup is O(1). Adds a dependency but is straightforward.

Option 3: Pub/Sub per user. Each connection server subscribes to a Redis Pub/Sub channel for each connected user. To deliver a message to User B, publish to channel user:B. Whichever connection server has subscribed (because B is connected there) receives it. No explicit registry needed.

Option 3 is elegant for interviews. Redis Pub/Sub channels are nearly free -- they are in-memory pointers to subscriber connections with no message persistence. Creating a channel per user costs a few bytes each. At 500M users, that is a few GB of Redis memory. Compare this to Kafka, where each topic/partition has ~50 KB of overhead -- 500M topics would need 25 TB. Kafka is the wrong tool here.

Message Flow: 1:1 Chat

1. User A types message
2. A's device encrypts with Signal Protocol (Double Ratchet)
3. Encrypted blob sent over persistent TCP to A's connection server
4. Connection server writes to Messages table + creates Inbox entry for B
5. Server sends ACK to A (single check: "sent")
6. Connection server publishes to B's pub/sub channel
7. B's connection server receives, pushes over B's TCP connection
8. B's device decrypts with Signal Protocol
9. B's device sends ACK back to server
10. Server marks message as delivered (double check)
11. Server removes Inbox entry for B
12. When B opens conversation, B's device sends "read" notification
13. Server relays "read" to A (blue checks)

Critical detail: The message is written to durable storage BEFORE attempting real-time delivery (step 4 before step 6). If the pub/sub delivery fails, the message is safe in B's inbox and will be delivered when B reconnects.

Store-and-Forward: The Offline Queue

This is the feature that makes messaging fundamentally different from live chat. Users expect every message to arrive, even if the recipient's phone is off for a week.

The Inbox model:

Every message creates an entry in the recipient's Inbox (a per-user ordered queue of undelivered message references).
If the recipient is online, real-time delivery is attempted via the persistent connection.
If the recipient is offline, the Inbox entry persists.
When the recipient reconnects, the connection server fetches all pending Inbox entries and delivers them.
The client ACKs each message. ACK triggers Inbox entry deletion.
Messages have a 30-day TTL. Undelivered messages are deleted after 30 days.

Bursty reconnection handling: After a network outage, millions of users reconnect simultaneously. Each one triggers an Inbox fetch. The Inbox service must handle this thundering herd without collapsing. Solution: exponential backoff with jitter on client reconnection, plus server-side connection rate limiting (accept new connections up to capacity, return others to the load balancer).

End-to-End Encryption: Signal Protocol

This is not optional seasoning. This is the defining architectural constraint. The server is a blind relay. It processes encrypted blobs it cannot read.

Three components:

1. X3DH (Extended Triple Diffie-Hellman) -- Initial Key Exchange

When Alice wants to message Bob for the first time:

Bob's device has pre-uploaded a bundle of one-time prekeys to the server. These are ephemeral public keys.
Alice downloads Bob's identity key + signed prekey + one-time prekey from the server.
Alice performs three Diffie-Hellman computations combining her identity key, an ephemeral key, and Bob's keys.
The result is a shared secret. Alice uses it to encrypt her first message.
Bob can derive the same shared secret when he receives the message, even though he was offline during the key exchange.

This is the magic: asynchronous key exchange. Alice can start an encrypted conversation with Bob while Bob's phone is off. Pre-SIGNAL, protocols like OTR required both parties to be online simultaneously.

2. Double Ratchet -- Ongoing Message Encryption

After the initial X3DH handshake, every message uses a unique encryption key derived from a continuously evolving "ratchet."

Symmetric ratchet: After each message, the encryption key is advanced. Even if an attacker captures one key, they cannot decrypt previous messages (forward secrecy).
DH ratchet: Periodically, the parties exchange new DH public keys, deriving entirely new key material. This provides post-compromise security -- if an attacker compromises your keys, security is restored after the next DH exchange.

The result: every single message has a unique encryption key. Compromising one key reveals one message. The attacker gets nothing else.

3. Sender Key -- Group Optimization

For a group of N members, naive encryption would encrypt each message N times (once per recipient). Sender Key optimizes this:

The sender generates a random Sender Key.
The Sender Key is distributed to each group member via their pairwise encrypted channel (N encryptions, done once).
Subsequent group messages are encrypted once with the Sender Key (1 encryption instead of N).
Sender Key is rotated when members join or leave.

For a 1,024-member group, this reduces per-message encryption from 1,024 operations to 1.

Architectural implications of E2E encryption:

No server-side search. The server cannot read messages, so it cannot index them. Search is client-side only.
No server-side content moderation. Spam detection must work on metadata (send rate, account age, reported status) not content.
Media encryption is separate. The sender encrypts media with a random AES-256 key, uploads the encrypted blob to blob storage, and sends the key inside the E2E encrypted message. The server stores encrypted media it cannot decrypt.
Backup encryption is the user's problem. E2E encrypted cloud backups (opt-in since October 2021) require the user to manage a password or 64-digit key. Lose it, and the backup is permanently unrecoverable.

Deep Dives

Channel Chat Fanout

Deep Dive 1: Group Messaging Fan-Out

When a message is sent to a 1,024-member group:

Sender encrypts once with the Sender Key and sends to the server.
Server looks up the group member list from the ChatParticipant table.
Server creates an Inbox entry for each of the 1,024 members (fan-out on write).
Server publishes to the pub/sub channel for each online member.

Write amplification math: A single group message to a 1,024-member group generates 1,024 Inbox writes. If the group has 100 messages/hour, that is 102,400 Inbox writes/hour for one group. Across all groups, this is substantial but manageable because most groups are small (2-10 members) and large groups are rare.

Why the group limit is 1,024, not unlimited: The cap exists to bound fan-out. Each group message creates N inbox entries and N pub/sub publishes. At 1,024, this is tractable. At 100K (Slack channel size), you need fundamentally different architecture -- which is exactly what Lesson 3 covers.

Delivery ordering in groups: WhatsApp uses server-side timestamps for per-conversation ordering. All group members see messages in the same order (the order the server received them). This is not global ordering -- messages across different groups may be ordered differently. For chat, this is fine. Users care about conversation-local ordering, not cross-conversation consistency.

Deep Dive 2: Multi-Device Support

Modern WhatsApp supports up to 4 linked devices per account. This adds significant complexity:

The problem: When User A sends a message to User B, it must be delivered to all of B's devices (phone, laptop, tablet, web). Each device has its own encryption keys.

Solution:

Each device registers its own identity key and prekeys with the server.
Server maintains a Clients table mapping user_id -> [device_id_1, device_id_2, ...].
When a message is sent to User B, the server creates a separate Inbox entry for each of B's devices.
Each device independently fetches and ACKs its Inbox entries.
A message is considered "delivered" when ANY of B's devices ACKs it.

Encryption complexity: With Sender Key for groups, the Sender Key must be distributed to each device of each group member. For a 1,024-member group where each member has 4 devices, that is 4,096 pairwise key distributions when the Sender Key is first established. This is a one-time cost per key rotation.

History sync: When a new device is linked, it does NOT receive historical messages from the server (the server does not have them in plaintext). Instead, the primary device transfers encrypted history to the new device via a direct encrypted connection. This is why linking a new WhatsApp device takes a while and requires the phone to be nearby.

Deep Dive 3: Delivery Receipts at Scale

The check mark system seems simple but generates enormous metadata traffic:

Single check (sent): Server ACKs receipt of the message. This is a server -> sender response on the existing connection. Trivial.
Double check (delivered): Recipient's device ACKs download. This is a recipient -> server -> sender message flow. One additional message per original message.
Blue checks (read): Recipient opens the conversation. This generates a "read up to message X" notification. One additional message per conversation open, not per message read.

Scale math: If 100B messages/day each generate 2 receipt messages (delivered + read), that is 200B additional small messages/day -- doubling the message volume. In practice, receipts are batched (one "read" receipt covers all messages up to a timestamp) and are best-effort (losing a receipt is annoying, not catastrophic).

Optimization: read receipts are not persisted. They flow through the real-time path only. If the sender is offline when the read receipt arrives, they see the blue checks on next app open (the client caches read state locally). This avoids doubling the Inbox write load.

Alternative Designs

WebSocket Instead of Raw TCP

Most interview candidates (and most real systems built today) use WebSocket instead of raw TCP with a custom binary protocol. This is perfectly reasonable.

Trade-offs:

WebSocket adds ~2-6 bytes of framing overhead per message. At 100B messages/day, this is ~200-600 GB/day of extra overhead. Measurable but not a dealbreaker.
WebSocket runs over HTTP, which means it passes through corporate firewalls and proxies more reliably than raw TCP.
WebSocket has better library support in every language.
FunXMPP over raw TCP was the right call in 2009 when WhatsApp started. WebSocket would be the right call today.

Kafka Instead of Redis Pub/Sub

Some candidates propose Kafka for message routing. This is a reasonable instinct but a poor fit here.

Why Kafka fails:

One topic per user means billions of topics. Each Kafka partition has ~50 KB of overhead. 1 billion users x 50 KB = 50 TB of partition metadata alone.
Kafka's consumer groups are static. Users constantly connect and disconnect, which means consumer group membership churns continuously. Kafka is not optimized for this.
Kafka's pull-based model adds latency (consumers poll for messages). Redis Pub/Sub pushes immediately.

Common Mistake: Kafka Topic Per User

A common interview mistake is proposing a Kafka topic per user for message delivery. Kafka's metadata overhead makes this impractical beyond ~10-20K topics -- each partition requires ZooKeeper/KRaft metadata, segment files on disk, and replication state across brokers. At 1 billion users, you would need 1 billion partitions minimum. The correct split: use Kafka for durable message storage and delivery guarantees (the Inbox write path, async fan-out, audit trail), but use Redis Pub/Sub or a custom connection registry for real-time push to connected clients. Redis Pub/Sub handles millions of channels as lightweight in-memory pointers to subscriber connections with negligible per-channel overhead -- creating a channel costs a few bytes, not 50 KB.

Where Kafka fits: Kafka is excellent for the async processing pipeline -- feeding messages to the spam detection service, analytics, push notification triggers. It is the wrong tool for the real-time delivery path.

DynamoDB Instead of Mnesia

WhatsApp used Mnesia (Erlang's built-in distributed database) for session data. For an interview, DynamoDB is a reasonable choice for the Messages and Inbox tables:

Messages table: Partition key = chat_id, sort key = message_id (Snowflake ID for chronological sorting).
Inbox table: Partition key = user_id, sort key = message_id. Supports efficient "fetch all pending messages for this user" queries.
TTL: DynamoDB supports per-item TTL, which handles the 30-day message expiry cleanly.

Scaling Math

Connection Layer

Metric	Calculation	Result
Peak concurrent users	2B DAU x 50% concurrent	1B connections
Connections per server	WhatsApp benchmark	2M
Connection servers needed	1B / 2M	500 servers
Memory per connection	~3 KB (Erlang process)	--
Memory per server	2M x 3 KB	6 GB (just processes)

500 servers for the entire connection layer of the world's largest messaging app. This is why the "35 engineers" number is real.

Message Throughput

Metric	Calculation	Result
Messages per second	100B / 86,400	1.15M msg/sec
Per server throughput	1.15M / 500	2,300 msg/sec/server
Bandwidth (text only)	1.15M x 100B	115 MB/sec

2,300 messages per second per server is laughably easy. A single modern server can handle 100K+ messages per second. The bottleneck is never CPU or network -- it is holding the connections open.

Storage

Metric	Calculation	Result
Text data per day	100B x 100 bytes	10 TB/day
30-day retention	10 TB x 30	300 TB
Media per day	5B media msgs x 500 KB	2.5 PB/day
User metadata	3B x 1 KB	3 TB (fits in memory)

Text storage is trivial. Media dominates, but media is stored in blob storage (S3-equivalent), which is a solved problem.

Inbox Throughput

Metric	Calculation	Result
Inbox writes per second	1.15M msgs x ~1.5 avg recipients	~1.7M writes/sec
Inbox reads per second	~500M reconnections/day / 86,400	~5,800/sec (bursty)

The Inbox is write-heavy, read-bursty. DynamoDB or Cassandra handle this well.

Failure Analysis

Connection Server Dies

Impact: ~2M users lose their TCP connections.

Recovery: 1. Load balancer detects server failure via health check (5-10 seconds). 2. Clients detect connection loss (TCP keepalive timeout or missed heartbeat). 3. Clients reconnect with exponential backoff + jitter to other connection servers. 4. On reconnect, client fetches Inbox for missed messages. 5. No messages are lost because all messages are written to durable storage before real-time delivery.

Recovery time: 10-30 seconds for most users. Some experience longer delays due to jitter.

Redis Pub/Sub Node Dies

Impact: Messages sent during the outage are not delivered in real-time.

Recovery: Messages are still in the Inbox. When recipients reconnect or the pub/sub layer recovers, pending messages are delivered via Inbox fetch. At-least-once delivery is preserved.

This is why the Inbox exists. Real-time delivery (pub/sub) is best-effort. Durable delivery (Inbox) is guaranteed.

Inbox Database Overload

Impact: Messages cannot be persisted. Real-time delivery still works for online users, but offline users lose messages.

Mitigation: - Write-ahead to a durable queue (Kafka) before attempting Inbox write. Retry on failure. - Circuit breaker on Inbox writes to prevent cascading failures. - This is the one failure that can cause message loss. It is the most critical path to protect.

Network Partition Between Data Centers

Impact: Users in one data center cannot message users in another.

Mitigation: - Messages are queued locally and forwarded when the partition heals. - Users in the same data center can still message each other. - Cross-DC replication of user session data allows the "other" DC to accept connections for affected users, but message routing requires eventual consistency.

Prekey Exhaustion

Impact: If a user's device has not uploaded new one-time prekeys and all existing ones are consumed, new conversations cannot be established with forward secrecy.

Mitigation: - Fall back to using only the signed prekey (which is semi-static). This provides weaker forward secrecy but still allows encrypted communication. - Client periodically checks prekey count and uploads new batches when running low. - Server alerts the client when prekey supply is critically low.

Level Expectations

Level	What You Must Cover	What Sets You Apart
Mid-Level	Persistent connections (WebSocket), message DB with user-partitioned inbox, online/offline delivery paths, basic E2E encryption mention	Clean separation of concerns, mentions delivery receipts, discusses message ordering
Senior	Full store-and-forward with ACK-based cleanup, Redis Pub/Sub for routing (with reasoning against Kafka), Signal Protocol overview (X3DH + Double Ratchet), group messaging with fan-out analysis, multi-device support	Explains WHY the server cannot search messages (E2E constraint), discusses Sender Key for groups, calculates connection server count from benchmark numbers
Staff+	Erlang/BEAM architecture with specific numbers (2M connections/server, 35 engineers), FunXMPP protocol evolution from XMPP, Sender Key rotation on membership change, prekey exhaustion handling, CDC from message writes for fan-out reliability, cell-based architecture for blast radius isolation	Mentions FreeBSD over Linux tradeoff, discusses push notification timing (wait 60s for ACK before sending push), knows the group limit is 1,024 not 100, discusses media encryption as separate flow from text

References from Our Courses

Delivery Guarantees — ACK-based store-and-forward for offline message delivery
Redis Data Structures and Use Cases — Pub/Sub for real-time message routing between servers
Distributed Transactions — exactly-once delivery semantics for message reliability

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design an Encrypted Messaging Service →