Dashboard for Monitoring Slow Queries
Level: L4-L6 Topics: System Design, Monitoring, Distributed Systems
Problem Statement
Design an internet-facing dashboard that displays the slowest search queries from the last 24 hours. The goal is to help engineers identify and debug performance problems.
Start by designing the system for a single server handling all search traffic. Then scale the design to handle the full production load across many servers.
This is a discussion-based system design problem. You should walk through requirements, architecture, data flow, and scaling considerations.
Background & Constraints
- A "search query" has at least: query string, timestamp, and latency (response time).
- The dashboard should show the top N slowest queries (e.g., top 100) from a rolling 24-hour window.
- The dashboard is read by engineers — it does not need sub-second refresh, but should be reasonably up to date (e.g., refreshed every few minutes).
- In production, there could be thousands of search servers handling millions of queries per second.
- Latency data is generated on the search servers at query completion time.
Examples
Dashboard output (conceptual):
Rank | Query | Latency (ms) | Timestamp
-----|------------------------|--------------|--------------------
1 | "rare antique clocks" | 12,450 | 2024-03-15 14:22:03
2 | "translate pdf 500pg" | 11,200 | 2024-03-15 09:18:45
3 | "flight NYC to Mars" | 10,800 | 2024-03-15 22:01:12
... | ... | ... | ...
100 | "best pizza near me" | 3,200 | 2024-03-15 16:44:01
Hints & Common Pitfalls
Single Server Design
- In-memory min-heap: Maintain a min-heap of size N. For each query, if the latency exceeds the heap's minimum, replace it. This gives O(log N) per query.
- Rolling window: You need to expire entries older than 24 hours. Consider periodically scanning the heap, or maintaining a separate time-ordered structure.
- Persistence: If the server restarts, the in-memory data is lost. Consider writing periodic snapshots to disk.
Scaling to Multiple Servers
- Local aggregation: Each search server maintains its own local top-N. A central aggregator merges the local top-N lists periodically.
- Push vs. pull: Servers can push their top-N to the aggregator, or the aggregator can poll servers. Think about tradeoffs (staleness, load, failure handling).
- Storage tier: For durability and historical analysis, consider writing slow query logs to a time-series database or log storage system.
Common Discussion Points
- What threshold defines "slow"? Top-N is relative, but you might also want an absolute threshold (e.g., > 1 second).
- Deduplication: Should the same query text appearing multiple times count as separate entries or be grouped?
- Privacy: Query strings may contain sensitive data. Consider anonymization or access controls.
- Alerting: Beyond the dashboard, should the system trigger alerts when query latency spikes?
Follow-Up Questions
-
Failure handling: What happens if one of the search servers goes down? How do you ensure the dashboard remains accurate with partial data?
-
Historical trends: How would you extend the system to show slow query trends over weeks or months? What storage and aggregation strategy would you use?
-
Real-time streaming: If the dashboard needs to update in near real-time (every few seconds), how does this change the architecture? Consider streaming frameworks.
-
Cost optimization: With thousands of servers and millions of QPS, how do you minimize the overhead of collecting and aggregating latency data without missing important outliers?
-
Multi-datacenter: If search traffic is served from multiple datacenters worldwide, how do you build a unified global dashboard?