Design a Top-K Trending System
TL;DR
The question is really about streaming algorithms. Exact counting of 4 billion distinct videos requires 64 GB of memory. A Count-Min Sketch does the same job in 109 KB. A Space-Saving algorithm tracks the top-1000 items in 120 KB. These are not approximations you tolerate -- they are approximations that perform better than exact solutions because they fit in L1 cache. The system architecture is straightforward: Kafka for event ingestion, Flink for stream processing, Redis for serving. The algorithm choice is the interview. Know Count-Min Sketch AND Space-Saving -- most candidates only know one.
The System
Think YouTube's "Trending" page. At any given moment, you can see the most-viewed videos from the last hour, last day, and all-time. YouTube processes 70 billion views per day across 4 billion distinct videos. The system must identify the top-1000 most-viewed videos in near-real-time (within 1-5 minutes of the actual ranking).
This is not a storage problem. It is not a consistency problem. It is a counting problem at extreme scale -- and the solution lies in streaming algorithms that use sub-linear space to track heavy hitters in a data stream.
The insight that separates strong candidates: you do NOT need to count every video. You only need to count the heavy hitters. A video with 3 views in the last hour will never be in the top-1000. Approximate counting algorithms exploit this fact, reducing memory from 64 GB to under 1 MB.
Requirements
Functional Requirements
| Requirement | Details |
|---|---|
| Top-K videos | Return the K most-viewed videos for configurable time windows |
| Time windows | Last 1 hour, last 24 hours, last 7 days, last 30 days, all-time |
| Near-real-time | Results reflect views from the last 1-5 minutes |
| API | GET /trending?window=1h&k=100 returns top-K videos with view counts |
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Freshness | Top-K reflects views from the last 1-5 minutes |
| Accuracy | Top-100 should be exact or within 1-2 positions of exact |
| Read latency | < 50 ms (pre-computed results, just a cache lookup) |
| Write throughput | 810K view events per second sustained |
| Availability | 99.9% (trending is not a critical feature -- slight staleness is acceptable) |
Back-of-Envelope Math
Views per day: 70 billion
Views per second: 70B / 86400 = ~810,000
Distinct videos (total): ~4 billion
Active videos per hour: ~100 million (estimate)
Event size: ~64 bytes (video_id: 8B, timestamp: 8B, metadata: 48B)
Event throughput: 810K * 64 = ~52 MB/sec
Exact counting memory:
All videos: 4B videos * 16 bytes (id + count) = 64 GB
Active per hour: 100M * 16 bytes = 1.6 GB
Approximate counting memory:
Count-Min Sketch: ~109 KB (epsilon=0.001, delta=0.01)
Space-Saving (5000): 5000 * 24 bytes = ~120 KB
The 600,000x space reduction from exact to approximate is the entire point of this design.
Naive Design
Approach: Hash map of video_id -> count per time window.
For a 1-hour window:
1. Maintain a HashMap<video_id, count> in memory.
2. For each view event, increment map[video_id].
3. At the end of the hour, sort by count, take the top K.
4. Clear the map, start fresh.
For all-time: Never clear the map. Just accumulate forever.
This works. At 100M active videos per hour, the map consumes 1.6 GB -- fits in a single server's memory. For the 1-hour window, you are done. Ship it.
The problem emerges with multiple windows. For "top-K in the last 24 hours," you need to merge 24 hourly maps. For "last 30 days," you need to merge 720 hourly maps. And the merge is where things get tricky.
Where It Breaks
Problem 1: Multi-Window Merging Loses Accuracy
To get the top-1000 for the last 24 hours, you merge 24 hourly top-1000 lists:
- Take the union of all video_ids across 24 hourly lists.
- Sum their counts.
- Sort and take the top 1000.
The failure case: Video A is #1001 (just outside top-1000) in every single hourly window. Its total daily count might be higher than any video that was in the top-1000 of a few hours. But since Video A never appeared in any hourly top-1000 list, it is invisible during the merge. You completely miss a video that should be in the daily top-1000.
Problem 2: Sliding Windows Are Expensive
A "last 24 hours" window that updates every minute means each event belongs to 1,440 overlapping windows (24 hours * 60 minutes). For a 30-day window with 1-minute granularity, each event belongs to 43,200 windows.
State per item with naive sliding windows:
1-month window, 1-minute slide:
43,200 * 8 bytes per video = 346 KB per video
100M active videos: 346 KB * 100M = 34.6 PETABYTES
This is absurd. Sliding windows at minute granularity are simply not feasible with exact counting at this scale.
Problem 3: All-Time Counts Grow Forever
The all-time top-K requires counts for all 4 billion videos. At 16 bytes each (8-byte ID + 8-byte count), that is 64 GB. This does not fit on a single machine. You need either sharding or approximate counting.
Problem 4: Flink is a Double-Edged Sword
Many candidates (and many online guides) jump straight to Apache Flink. Flink is the right tool for this job. But there is a catch: many interviewers are unfamiliar with Flink internals, and if you spend 10 minutes explaining Flink's windowing semantics, you consume valuable interview time without demonstrating system design skill. Have a Flink answer ready, but also have a non-Flink fallback.
Interview Tip
Many interviewers are unfamiliar with Flink. If you say "Flink," you may spend 10 minutes explaining its internals instead of demonstrating your system design skills. Safer framing: "a stream processing framework like Flink or Kafka Streams" -- this shows you know the tools exist without derailing into implementation details. Only go deep on Flink if the interviewer asks.
Real Design

Architecture
CDN / App Servers
│
▼
┌──────────┐
│ Kafka │ (partitioned by video_id)
│ Topic │
└────┬─────┘
│
▼
┌──────────────────┐
│ Stream Processor │ (Flink or custom)
│ │
│ ┌─────────────┐ │
│ │ 1-min window │ │ Count per video per minute
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Top-K with │ │ Space-Saving or CMS + heap
│ │ safety factor│ │
│ └──────┬──────┘ │
└─────────┼─────────┘
│
▼
┌──────────────────┐
│ Aggregation │ Merge minute → hour → day → month
│ Service │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Redis │ Sorted sets per window
│ (serving cache) │
└──────────────────┘
│
▼
┌──────────────────┐
│ API Service │ GET /trending?window=1h&k=100
└──────────────────┘
Component 1: Kafka Ingestion
View events flow from CDN edge servers or application servers into a Kafka topic.
Partitioning: Partition by video_id. This ensures all views for the same video land on the same Kafka partition, enabling exact per-video counting on each partition without cross-partition coordination.
Partition sizing: At 810K events/sec and ~64 bytes/event, total throughput is ~52 MB/sec. A single Kafka partition handles ~150K events/sec for small messages. Minimum partitions: ceil(810K / 150K) = 6. In practice, use 50-100 partitions for parallelism.
Hot partition concern: The fastest YouTube video to reach 1 billion views (APT by Rose) averaged ~89 views/sec. Even with a 10x burst, that is 890 views/sec -- well within a single partition's capacity of 150K events/sec. Hot partitions are not a real problem for video view events.
Component 2: Tumbling Windows (The First Design Decision)
This is the first question to ask the interviewer: do you want tumbling or sliding windows?
Tumbling windows: Fixed-size, non-overlapping intervals. A 1-hour tumbling window runs from :00 to :00. Each event belongs to exactly one window. Simple state management.
Sliding windows: Fixed-size windows that advance by a "slide" interval. A 1-hour window with 1-minute slide produces a new result every minute. Each event belongs to 60 windows. State explodes.
Propose tumbling windows. If the interviewer objects ("but I want the top-K to update every minute"), explain the hierarchical aggregation approach: use 1-minute tumbling micro-windows, then merge them into hourly results. The "top-K for the last hour" is the merge of the last 60 one-minute micro-windows. This gives minute-level freshness without the state explosion of true sliding windows.
Component 3: Count-Min Sketch (The Algorithm Everyone Knows)
CMS is a 2D array of counters with dimensions w x d:
- Width
w = ceil(e / epsilon)-- controls accuracy - Depth
d = ceil(ln(1 / delta))-- controls failure probability
Update(video_id): For each row j, hash the video_id and increment counter[j][hash_j(video_id)].
Query(video_id): Return min over all j of counter[j][hash_j(video_id)].
Error guarantees: The estimate is always >= the true count (one-sided overestimate). With probability >= 1 - delta, the overestimate is at most epsilon * N where N is the total stream length.
YouTube sizing:
epsilon = 0.001 (allow 0.1% error)
delta = 0.01 (99% confidence)
w = ceil(e / 0.001) = 2,719
d = ceil(ln(100)) = 5
Total counters = 2,719 * 5 = 13,595
At 8 bytes per counter: 109 KB
109 KB to replace 64 GB. This is the number to say in the interview.
CMS alone is not enough: CMS answers point queries ("how many views does video X have?") but does NOT answer "what are the top-K videos?" You need to pair it with a min-heap of size K:
- For each view event for
video_id: CMS.update(video_id, 1)estimated_count = CMS.query(video_id)- If
estimated_count > heap.min()orheap.size < K: insert/update in heap - To query top-K: return heap contents, sorted.
CMS mergeability: Two CMS instances with the same hash functions merge by element-wise addition. This is essential for distributed counting -- each Kafka partition maintains its own CMS, and at window boundaries, all partition sketches merge into a global sketch with the same error guarantees.
Component 4: Space-Saving Algorithm
Space-Saving is purpose-built for top-K heavy hitters in data streams. Unlike CMS (which estimates frequencies but does not track identities), Space-Saving directly maintains a ranked list of the K most frequent items — no separate heap needed.
Paper: "Efficient Computation of Frequent and Top-k Elements in Data Streams" by Metwally, Agrawal, and El Abbadi (ICDT 2005). This is the original paper that introduced the algorithm and the Stream-Summary data structure.
Data structure — Stream-Summary: A fixed-size map of m monitored elements. Each entry stores three fields:
item_id: the video/URL/key being trackedcount: the estimated frequency (may overestimate, never underestimates)error: the maximum possible overestimation (equal to the count of the element that was evicted to make room)
Internally, the original paper implements this as a doubly-linked list of buckets ordered by count. Each bucket holds all elements with that count. This gives O(1) for finding the minimum (head of list) and O(1) for incrementing (move element to the next bucket, creating it if needed).
Algorithm — processing each incoming event:
process(item):
if item is already in the monitored set:
item.count += 1 // O(1): move to next bucket
elif monitored_set.size < m:
add item with count = 1, error = 0 // room available
else: // set is full — must evict
e_min = element with smallest count // O(1): head of bucket list
item.error = e_min.count // worst case: item had this many
// events before we started tracking it
item.count = e_min.count + 1 // inherit count + the new event
replace e_min with item in the set // O(1): swap in same bucket, move up
Worked example — monitoring top-3 (m=3) in the stream [A, B, C, D, A, B, D, D]:
| Event | Monitored Set | Action |
|---|---|---|
| A | {A:1} | Add (room available) |
| B | {A:1, B:1} | Add (room available) |
| C | {A:1, B:1, C:1} | Add (set full now) |
| D | {A:1, B:1, D:2 (err=1)} | Evict C (min count=1). D inherits count 1+1=2 |
| A | {A:2, B:1, D:2 (err=1)} | A already tracked, increment |
| B | {A:2, B:2, D:2 (err=1)} | B already tracked, increment |
| D | {A:2, B:2, D:3 (err=1)} | D already tracked, increment |
| D | {A:2, B:2, D:4 (err=1)} | D already tracked, increment |
Final: D is guaranteed top-1 (count=4, error=1, so true count is between 3 and 4 — either way, highest). The error field tells us exactly how uncertain each estimate is.
Guarantees (proven in the paper):
- Any item with true frequency > N/m is guaranteed to be in the monitored set (no false negatives above this threshold)
- The overestimation for any monitored item is at most N/m (where N = total events seen)
- For top-K with K < m: if the K-th item's
count - errorexceeds the (K+1)-th item'scount, the top-K ranking is exact
Sizing for top-1000: Monitor m = 5,000 items (5x safety factor). Memory: 5,000 × 24 bytes (id + count + error) = 120 KB. All operations O(1).
Why Space-Saving beats CMS for this specific problem:
| Aspect | Count-Min Sketch | Space-Saving |
|---|---|---|
| Tracks item identities | No (needs separate heap) | Yes (built-in) |
| Update time | O(d) hash functions per update | O(1) per update |
| Space for top-1000 | ~109 KB (sketch) + heap overhead | ~120 KB total |
| Error type | Overestimate only, no per-item bound | Bounded per-item via error field |
| Mergeability | Yes (element-wise addition) | No (must merge top-K lists) |
| Best for | General frequency estimation | Specifically top-K heavy hitters |
Redis implementation: Redis 4.0+ provides a built-in TOPK command (via the RedisBloom module) that implements a probabilistic variant of Space-Saving. TOPK.ADD key item processes events; TOPK.LIST key returns the current top-K. Note that TOPK.ADD does not support DECRBY — you cannot subtract counts for sliding window expiry. For time-windowed top-K, use separate TOPK keys per time bucket.
When to Mention in Interviews
After you explain CMS, say: "CMS is the general-purpose frequency estimator. If the problem is specifically top-K, the Space-Saving algorithm (Metwally et al., ICDT 2005) is more natural — it directly tracks the top items with per-item error bounds, all operations are O(1), and it uses roughly the same memory." This shows you know there are purpose-built algorithms, not just the one everyone memorizes.
Component 5: Hierarchical Aggregation with Safety Factor
The architecture for multi-window top-K:
Each level aggregates results from the level below.
The safety factor: When merging top-K lists from sub-windows, store K * safety_factor items per sub-window.
If you need top-1000 globally, store top-5000 per hourly window. When merging 24 hourly top-5000 lists, the true global top-1000 is overwhelmingly likely to be captured.
Why a 5x safety factor works: For Zipf-distributed data (which video views follow), the K-th most popular item has frequency proportional to 1/K. The ratio between the 1000th and 5000th items is 5x, providing a large gap. An item would need to fluctuate by more than 5x its typical rank to be missed -- extremely unlikely in aggregate.
Merge procedure: 1. Take the union of all video_ids across 24 hourly top-5000 lists. 2. Sum their hourly counts. 3. Sort by total count. 4. Take the top 1000.
With a 5x safety factor, items ranked #1001 through #5000 in each hourly window are included in the merge. This dramatically reduces the chance of missing a globally top-1000 video.
Component 6: Redis Serving Layer
Pre-computed top-K results are stored in Redis sorted sets:
ZADD trending:1h video_id_1 count_1
ZADD trending:1h video_id_2 count_2
...
ZADD trending:24h video_id_1 daily_count_1
...
ZADD trending:all_time video_id_1 total_count_1
API queries become a single Redis call: ZREVRANGE trending:1h 0 99 WITHSCORES returns the top 100 videos for the last hour.
Update frequency: every 1 minute for the 1-hour window, every 5 minutes for the 24-hour window, every 15 minutes for longer windows. This matches user expectations -- nobody expects the "top videos this month" to update every second.
Deep Dives

Deep Dive 1: The Non-Flink Fallback
Many interviewers are unfamiliar with Flink and will redirect the conversation. Here is the alternative:
Architecture: Custom Go/Java service consuming from Kafka.
Each consumer instance:
1. Reads from one or more Kafka partitions.
2. Maintains an in-memory HashMap<video_id, count> for the current 1-minute window.
3. At the end of each minute, extracts the local top-5000, publishes to an output Kafka topic.
4. Clears the map, starts the next window.
An aggregation service: 1. Reads from the output topic. 2. Merges top-5000 lists from all partitions (union + sum counts + re-sort). 3. Publishes the global top-1000 to Redis.
Why this works: Since Kafka is partitioned by video_id, each partition has the complete count for each video it owns. The local top-5000 from each partition is correct -- it is not a partial count that needs cross-partition merging. The aggregation is just combining disjoint top-K lists.
The advantage over Flink: Simpler to explain, simpler to operate, no Flink cluster to manage. The disadvantage: no built-in watermarking, no checkpoint/restart, manual window management. For a 45-minute interview, the simplicity wins.
Deep Dive 2: Exactly-Once vs At-Least-Once
For view counting in a top-K system, at-least-once is acceptable.
Why: Small overcounts (a view counted twice) do not meaningfully affect which videos are in the top-K. If a video has 10 million views and 500 are duplicates, the ranking is unchanged. The alternative -- exactly-once via Kafka transactions + Flink checkpoints -- adds significant latency and complexity.
When exact counts matter: For monetization (paying creators per view), exact deduplication is required. Use idempotency keys: each view event carries a view_event_id (hash of user_id + video_id + timestamp). The counting service deduplicates by checking if the view_event_id was already processed. This requires per-event state, which is much more expensive than the streaming algorithms discussed above.
The practical answer: Use at-least-once for the trending system. If the interviewer insists on exact counts, explain that exact counting requires per-event deduplication (a separate system), and the trending system can consume from the deduplicated stream.
Deep Dive 3: Late Events and Watermarks
Events arrive out of order. A view event from 10:01:15 might arrive at 10:01:45 (30 seconds late). If the 10:01 window already closed, this event is "late."
Watermark mechanism: The stream processor tracks watermark = max(event_time) - allowed_lateness. Events with timestamps before the watermark are considered late.
For this system: Allow 30 seconds of lateness. The 10:01 window stays open until the watermark advances past 10:01:30 (i.e., until we have seen events with timestamps past 10:02:00). Events arriving after that are either dropped or sent to a side output for separate late processing.
Why lateness tolerance is fine: A few hundred missed events out of 810K per second do not affect the top-K ranking. The top-1000 videos each have thousands of views per minute. Missing a handful of late events changes nothing.
Alternative Designs
Alternative 1: Database with Periodic Batch Query
Store raw view events in a time-series database (TimescaleDB, ClickHouse). Run a periodic query:
SELECT video_id, COUNT(*) as views
FROM view_events
WHERE event_time >= NOW() - INTERVAL '1 hour'
GROUP BY video_id
ORDER BY views DESC
LIMIT 1000;
Pros: Exact results. Simple to understand. SQL-based tooling.
Cons: At 70B events/day, the table grows by 4.5 TB/day (64 bytes/event). Even with TimescaleDB's compression, the hourly query scans ~200 GB of data. Query time: minutes, not seconds. Not suitable for 1-minute freshness.
When to use: For daily/monthly top-K where minutes of compute time are acceptable. For hourly or more frequent updates, streaming is required.
Alternative 2: Redis Top-K with Time-Series Counters
Redis has a built-in TOPK data structure (probabilistic, based on Space-Saving) and CMS commands.
TOPK.RESERVE trending_1h 1000 50 7 0.925
TOPK.ADD trending_1h video_123 video_456 ...
TOPK.LIST trending_1h
Pros: Zero custom code. Redis handles the probabilistic data structure.
Cons: No built-in windowing -- you must manage window rotation manually. Redis CMS does not support DECRBY, so sliding windows are not possible. Single-threaded Redis may bottleneck at 810K updates/sec (Redis handles ~100K operations/sec per instance).
When to use: For moderate scale (< 100K events/sec) where operational simplicity is paramount. Pre-aggregate in Flink/Kafka consumers, then feed the pre-aggregated counts to Redis TOPK.
Alternative 3: MapReduce Batch Pipeline
A two-phase batch job:
Phase 1 (Count): map(event) -> (video_id, 1), reduce(video_id, counts) -> (video_id, sum).
Phase 2 (Top-K): Each mapper reads a partition of counts, maintains a local min-heap of size K, emits local top-K. A single reducer merges all local top-K lists.
Why the local top-K optimization matters: Without it, the reducer processes 100M items. With 100 mappers and K=1000, the reducer processes 100,000 items -- a 1000x reduction.
When to use: For hourly/daily batch computation. Run alongside the streaming pipeline for exact "ground truth" comparison. Production systems often have both: streaming for real-time, batch for periodic reconciliation.
Scaling Math
Kafka Cluster
Event rate: 810K events/sec
Event size: 64 bytes
Throughput: 52 MB/sec
Partitions: 100 (each handles ~8,100 events/sec)
Broker count: 5-10 (each handles 10-20 partitions)
Retention: 72 hours (for replay on consumer failure)
Storage: 52 MB/sec * 72h * 3600 = 13.5 TB (with 3x replication: 40 TB)
Stream Processing
Flink task slots: 100 (one per Kafka partition)
Memory per slot: HashMap for 1-min window: ~10M active videos * 16 bytes = 160 MB
OR Space-Saving with 5000 monitors: 120 KB
Use HashMap for exact 1-min counts, Space-Saving for longer windows
Total Flink memory: 100 * 200 MB = 20 GB (fits on 5 machines with 16 GB heap each)
Redis Serving
Sorted sets:
trending:1h 1000 entries * 16 bytes = 16 KB
trending:24h 1000 entries * 16 bytes = 16 KB
trending:7d 1000 entries * 16 bytes = 16 KB
trending:30d 1000 entries * 16 bytes = 16 KB
trending:all 1000 entries * 16 bytes = 16 KB
Total: 80 KB (yes, the entire serving layer fits in 80 KB)
Read QPS: ~1K (trending page is not high-traffic)
Write QPS: ~1/minute (result updates)
Single Redis instance: more than sufficient
Failure Analysis
| Failure | Impact | Mitigation |
|---|---|---|
| Kafka broker goes down | Events reroute to other brokers (replication factor 3). No data loss. | Standard Kafka replication. ISR (In-Sync Replicas) ensure durability. |
| Flink task fails | Window state for that partition lost. Replay from Kafka. | Flink checkpoints to durable storage (S3/HDFS) every 30 seconds. On failure, restore checkpoint and replay from Kafka offset. |
| Consumer lag exceeds 5 minutes | Top-K results become stale. | Monitor consumer lag. Auto-scale Flink tasks. Alert on lag > 2 minutes. |
| Redis goes down | API returns stale or no results. | Redis Sentinel for failover. Results are recomputed every minute, so brief outages are self-healing. |
| Hot video (viral event) | Single Kafka partition receives disproportionate traffic. | Not a real problem. Even a 10x viral spike (8,900 events/sec for one video) is well within single-partition capacity (150K events/sec). |
| Clock skew between servers | Events have inconsistent timestamps, causing window misassignment. | Use event_time (from the originating server), not processing_time. Tolerate 30 seconds of out-of-orderness via watermarks. Synchronize clocks with NTP. |
| Space-Saving evicts a valid top-K candidate | A video that should be in the top-1000 is evicted from the 5000-monitor set. | Increase the safety factor from 5x to 10x (monitor 10,000 items, still only 240 KB). This makes false eviction statistically negligible. |
Level Expectations
| Level | What the Interviewer Expects |
|---|---|
| Mid (L4) | HashMap counting + periodic sort. Kafka for event ingestion. Tumbling windows. Redis sorted sets for serving. This passes. |
| Senior (L5) | Everything above plus: Count-Min Sketch with sizing math (109 KB). Tumbling vs sliding window discussion (propose tumbling first). Hierarchical aggregation (minute -> hour -> day). Safety factor for merge accuracy. Flink or custom stream processing. |
| Staff (L6) | Everything above plus: Space-Saving algorithm (not just CMS). Explicit comparison of CMS vs Space-Saving for this use case. Back-of-envelope showing 64 GB exact vs 109 KB approximate. Non-Flink fallback design. At-least-once vs exactly-once trade-off. Watermarks and late event handling. Quantified Kafka partition sizing showing hot partitions are a non-issue. MapReduce batch for ground-truth reconciliation alongside streaming. |
References from Our Courses
- HyperLogLog and Count-Min — approximate frequency counting for trending topics
- Kafka Partitions and Ordering — streaming event ingestion for real-time aggregation
- Flink and Stream Processing — windowed top-K computation over event streams
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.