Messaging & Real-Time
TL;DR
Discord migrated trillions of messages from Cassandra to ScyllaDB and solved hot partitions with request coalescing in Rust. Slack uses Vitess to shard MySQL at 2.3M QPS. WhatsApp uses Erlang for lightweight connections and deletes messages from the server after delivery. Each chose a radically different architecture driven by their specific access patterns.
Discord — From MongoDB to ScyllaDB
The Journey
Discord's database evolution is one of the best-documented migrations in the industry:
| Year | Database | Scale | Why They Changed |
|---|---|---|---|
| 2015 | MongoDB | Millions of messages | Worked fine at startup scale |
| 2017 | Cassandra | Billions of messages, 12 nodes | Needed horizontal scaling + time-ordered reads |
| 2022 | ScyllaDB | Trillions of messages, 177→72 nodes | Cassandra couldn't handle the latency requirements |
Why Cassandra Made Sense (Initially)
Discord's access pattern is ideal for wide-column stores: - Write-heavy: Millions of messages per second - Time-ordered: Messages within a channel are naturally sorted by time - Partition-friendly: All messages in one channel can be on one partition
The shard key was (channel_id, bucket) — a channel's messages stored together, bucketed by time period to prevent unbounded partition growth.
Why Cassandra Broke
At 177 nodes and trillions of messages, three problems emerged:
1. Hot partitions: Popular channels (gaming servers with millions of members) generated disproportionate read traffic on their partition. A single hot partition could cause cascading latency across the entire cluster.
2. Java garbage collection: Cassandra runs on the JVM. At scale, GC pauses lasted seconds — sometimes cascading into multi-minute outages that required manual operator intervention. This became the primary source of on-call toil.
3. Compaction storms: Cassandra's LSM-tree storage requires periodic compaction (merging SSTables). On large partitions, compaction consumed so much I/O that read latency spiked unpredictably.
The Solution: ScyllaDB + Data Services
ScyllaDB: A C++ rewrite of Cassandra. Same data model, same CQL query language, but: - No garbage collector (C++ manages memory directly) - Shard-per-core architecture (each CPU core owns its own data, eliminating contention) - More predictable latency under load
Data services layer: A Rust intermediary service between the application and database, exposed via gRPC. Its killer feature: request coalescing.
class RequestCoalescer:
def __init__(self):
self.in_flight = {} # key → Future
async def get(self, channel_id):
if channel_id in self.in_flight:
return await self.in_flight[channel_id] # wait for existing query
future = asyncio.get_event_loop().create_future()
self.in_flight[channel_id] = future
try:
result = await db.query(f"SELECT * FROM messages WHERE channel_id = '{channel_id}'")
future.set_result(result) # fan out to all waiters
return result
finally:
del self.in_flight[channel_id]
1,000 users open #general simultaneously → 1 database read, result fanned out to all 1,000. The database sees 1 query instead of 1,000.
Results
| Metric | Cassandra | ScyllaDB |
|---|---|---|
| p99 read latency | 40-125ms | 15ms |
| p99 write latency | 5-70ms | 5ms |
| Cluster size | 177 nodes | 72 nodes |
| GC-related incidents | Regular | None |
Migration: Trillions of messages migrated in 9 days using a Rust migrator processing 3.2 million records per second. The original Python migrator was too slow — rewriting it in Rust took about a day and made the migration feasible.
The Lesson
The data model (wide-column, partitioned by channel + time bucket) was right from the start. The problems were operational: GC pauses, hot partitions, compaction storms. Discord solved them by: 1. Switching to a GC-free implementation (ScyllaDB) 2. Adding an application-layer cache with request coalescing (data services)
Interview Tip
Discord is the perfect case study to cite when discussing wide-column stores, hot partition problems, or the difference between a data model problem and an implementation problem. "Discord's data model was right — messages partitioned by channel and time. But Cassandra's JVM overhead became the bottleneck at scale, so they moved to ScyllaDB (same model, C++ implementation) and added request coalescing to solve hot partitions."
Slack — MySQL Sharding with Vitess
The Architecture
Slack chose a fundamentally different approach: relational database with a sharding layer.
Vitess sits between Slack's application and MySQL: - Application sees one logical database - Vitess routes queries to the correct MySQL shard - Sharding, connection pooling, and query rewriting happen transparently
Sharding Strategy
Slack shards by workspace (team/company). All data for one workspace lives on the same shard: - Messages, channels, users, files — all co-located - "Get messages in channel X" → one shard, one query - No cross-shard joins for workspace-scoped operations (which is virtually all operations)
This is the simplest possible sharding strategy: tenant-based isolation. When your access patterns are naturally scoped to a tenant, shard by tenant.
Scale
- 2.3 million queries per second through Vitess
- Survived the COVID remote-work surge without re-architecture
- Online resharding for capacity planning: split a shard when a workspace grows too large
The Lesson
Slack's choice contradicts the common "SQL doesn't scale" narrative. By using Vitess, they get: - Full SQL with joins (within a workspace) - ACID transactions - Familiar MySQL tooling and operational expertise - Horizontal scaling through transparent sharding
No need to rewrite the application. No need to learn a new data model. Just add a sharding layer.
WhatsApp — Simplicity at Massive Scale
A Radically Different Approach
WhatsApp serves 2+ billion users with a remarkably small engineering team. Their secret: extreme simplicity in data modeling.
Connection handling: Erlang/BEAM virtual machine. Each user connection is a lightweight Erlang process (~2KB of memory). A single server handles millions of concurrent connections. Compare this to Java threads (~512KB each) — at 10 million connections, Erlang uses ~20GB vs Java's ~5TB.
User metadata: Mnesia, Erlang's built-in distributed database. Not a traditional database — it's tightly integrated with the Erlang runtime for ultra-fast lookups of online status, routing information, and session data.
Message storage: Messages are stored on the server only until delivered. Once both check marks appear (delivered to recipient's device), the message is deleted from the server. Long-term storage lives on the user's device (using SQLite).
End-to-End Encryption Changes the Data Model
Because WhatsApp uses end-to-end encryption, the server literally cannot read message content. This means: - No server-side search ("search messages" runs on your phone's local SQLite) - No server-side analytics on message content - No need for complex server-side message indexes - The server is essentially a message relay, not a message store
The Lesson
WhatsApp's data model is optimized for throughput and simplicity, not features. They don't need full-text search, message reactions databases, or rich media indexing on the server — the encrypted messages are opaque blobs that get relayed and deleted.
The architectural lesson: your data model should reflect what the server actually needs to do with the data. If the server is just a relay, the model is simple. If the server needs to search, rank, and analyze content, the model is complex.
Comparison
| Discord | Slack | ||
|---|---|---|---|
| Primary DB | ScyllaDB (wide-column) | MySQL via Vitess (relational) | Mnesia + temporary storage |
| Shard key | channel_id + time_bucket | workspace_id | User routing info |
| Message retention | Forever (server-side) | Forever (server-side) | Until delivered (then deleted) |
| Search | Elasticsearch (server-side) | Elasticsearch (server-side) | SQLite (client-side) |
| Key innovation | Request coalescing | Transparent MySQL sharding | Erlang lightweight processes |
| Team size at scale | ~100 engineers | ~1000 engineers | ~50 engineers |
Quick Recap
| Company | Key Insight |
|---|---|
| Discord | Wide-column model was right; solved operational issues by switching to ScyllaDB + request coalescing |
| Slack | MySQL + Vitess = SQL at scale. Shard by tenant for natural query isolation. |
| Server is a relay, not a store. Extreme simplicity enables massive scale with a small team. |