Messaging & Real-Time

TL;DR

Discord migrated trillions of messages from Cassandra to ScyllaDB and solved hot partitions with request coalescing in Rust. Slack uses Vitess to shard MySQL at 2.3M QPS. WhatsApp uses Erlang for lightweight connections and deletes messages from the server after delivery. Each chose a radically different architecture driven by their specific access patterns.

Discord — From MongoDB to ScyllaDB

The Journey

Discord's database evolution is one of the best-documented migrations in the industry:

Year	Database	Scale	Why They Changed
2015	MongoDB	Millions of messages	Worked fine at startup scale
2017	Cassandra	Billions of messages, 12 nodes	Needed horizontal scaling + time-ordered reads
2022	ScyllaDB	Trillions of messages, 177→72 nodes	Cassandra couldn't handle the latency requirements

Why Cassandra Made Sense (Initially)

Discord's access pattern is ideal for wide-column stores: - Write-heavy: Millions of messages per second - Time-ordered: Messages within a channel are naturally sorted by time - Partition-friendly: All messages in one channel can be on one partition

The shard key was (channel_id, bucket) — a channel's messages stored together, bucketed by time period to prevent unbounded partition growth.

Why Cassandra Broke

At 177 nodes and trillions of messages, three problems emerged:

1. Hot partitions: Popular channels (gaming servers with millions of members) generated disproportionate read traffic on their partition. A single hot partition could cause cascading latency across the entire cluster.

2. Java garbage collection: Cassandra runs on the JVM. At scale, GC pauses lasted seconds — sometimes cascading into multi-minute outages that required manual operator intervention. This became the primary source of on-call toil.

3. Compaction storms: Cassandra's LSM-tree storage requires periodic compaction (merging SSTables). On large partitions, compaction consumed so much I/O that read latency spiked unpredictably.

The Solution: ScyllaDB + Data Services

ScyllaDB: A C++ rewrite of Cassandra. Same data model, same CQL query language, but: - No garbage collector (C++ manages memory directly) - Shard-per-core architecture (each CPU core owns its own data, eliminating contention) - More predictable latency under load

Data services layer: A Rust intermediary service between the application and database, exposed via gRPC. Its killer feature: request coalescing.

class RequestCoalescer:
    def __init__(self):
        self.in_flight = {}  # key → Future

    async def get(self, channel_id):
        if channel_id in self.in_flight:
            return await self.in_flight[channel_id]     # wait for existing query

        future = asyncio.get_event_loop().create_future()
        self.in_flight[channel_id] = future
        try:
            result = await db.query(f"SELECT * FROM messages WHERE channel_id = '{channel_id}'")
            future.set_result(result)                    # fan out to all waiters
            return result
        finally:
            del self.in_flight[channel_id]

1,000 users open #general simultaneously → 1 database read, result fanned out to all 1,000. The database sees 1 query instead of 1,000.

Results

Metric	Cassandra	ScyllaDB
p99 read latency	40-125ms	15ms
p99 write latency	5-70ms	5ms
Cluster size	177 nodes	72 nodes
GC-related incidents	Regular	None

Migration: Trillions of messages migrated in 9 days using a Rust migrator processing 3.2 million records per second. The original Python migrator was too slow — rewriting it in Rust took about a day and made the migration feasible.

The Lesson

The data model (wide-column, partitioned by channel + time bucket) was right from the start. The problems were operational: GC pauses, hot partitions, compaction storms. Discord solved them by: 1. Switching to a GC-free implementation (ScyllaDB) 2. Adding an application-layer cache with request coalescing (data services)

Interview Tip

Discord is the perfect case study to cite when discussing wide-column stores, hot partition problems, or the difference between a data model problem and an implementation problem. "Discord's data model was right — messages partitioned by channel and time. But Cassandra's JVM overhead became the bottleneck at scale, so they moved to ScyllaDB (same model, C++ implementation) and added request coalescing to solve hot partitions."

Slack — MySQL Sharding with Vitess

The Architecture

Slack chose a fundamentally different approach: relational database with a sharding layer.

Vitess sits between Slack's application and MySQL: - Application sees one logical database - Vitess routes queries to the correct MySQL shard - Sharding, connection pooling, and query rewriting happen transparently

Sharding Strategy

Slack shards by workspace (team/company). All data for one workspace lives on the same shard: - Messages, channels, users, files — all co-located - "Get messages in channel X" → one shard, one query - No cross-shard joins for workspace-scoped operations (which is virtually all operations)

This is the simplest possible sharding strategy: tenant-based isolation. When your access patterns are naturally scoped to a tenant, shard by tenant.

Scale

2.3 million queries per second through Vitess
Survived the COVID remote-work surge without re-architecture
Online resharding for capacity planning: split a shard when a workspace grows too large

The Lesson

Slack's choice contradicts the common "SQL doesn't scale" narrative. By using Vitess, they get: - Full SQL with joins (within a workspace) - ACID transactions - Familiar MySQL tooling and operational expertise - Horizontal scaling through transparent sharding

No need to rewrite the application. No need to learn a new data model. Just add a sharding layer.

WhatsApp — Simplicity at Massive Scale

A Radically Different Approach

WhatsApp serves 2+ billion users with a remarkably small engineering team. Their secret: extreme simplicity in data modeling.

Connection handling: Erlang/BEAM virtual machine. Each user connection is a lightweight Erlang process (~2KB of memory). A single server handles millions of concurrent connections. Compare this to Java threads (~512KB each) — at 10 million connections, Erlang uses ~20GB vs Java's ~5TB.

User metadata: Mnesia, Erlang's built-in distributed database. Not a traditional database — it's tightly integrated with the Erlang runtime for ultra-fast lookups of online status, routing information, and session data.

Message storage: Messages are stored on the server only until delivered. Once both check marks appear (delivered to recipient's device), the message is deleted from the server. Long-term storage lives on the user's device (using SQLite).

End-to-End Encryption Changes the Data Model

Because WhatsApp uses end-to-end encryption, the server literally cannot read message content. This means: - No server-side search ("search messages" runs on your phone's local SQLite) - No server-side analytics on message content - No need for complex server-side message indexes - The server is essentially a message relay, not a message store

The Lesson

WhatsApp's data model is optimized for throughput and simplicity, not features. They don't need full-text search, message reactions databases, or rich media indexing on the server — the encrypted messages are opaque blobs that get relayed and deleted.

The architectural lesson: your data model should reflect what the server actually needs to do with the data. If the server is just a relay, the model is simple. If the server needs to search, rank, and analyze content, the model is complex.

Comparison

	Discord	Slack	WhatsApp
Primary DB	ScyllaDB (wide-column)	MySQL via Vitess (relational)	Mnesia + temporary storage
Shard key	channel_id + time_bucket	workspace_id	User routing info
Message retention	Forever (server-side)	Forever (server-side)	Until delivered (then deleted)
Search	Elasticsearch (server-side)	Elasticsearch (server-side)	SQLite (client-side)
Key innovation	Request coalescing	Transparent MySQL sharding	Erlang lightweight processes
Team size at scale	~100 engineers	~1000 engineers	~50 engineers

Quick Recap

Company	Key Insight
Discord	Wide-column model was right; solved operational issues by switching to ScyllaDB + request coalescing
Slack	MySQL + Vitess = SQL at scale. Shard by tenant for natural query isolation.
WhatsApp	Server is a relay, not a store. Extreme simplicity enables massive scale with a small team.