Dashboard for Monitoring Slow Queries

Level: L4-L6 Topics: System Design, Monitoring, Distributed Systems

Problem Statement

Design an internet-facing dashboard that displays the slowest search queries from the last 24 hours. The goal is to help engineers identify and debug performance problems.

Start by designing the system for a single server handling all search traffic. Then scale the design to handle the full production load across many servers.

This is a discussion-based system design problem. You should walk through requirements, architecture, data flow, and scaling considerations.

Background & Constraints

A "search query" has at least: query string, timestamp, and latency (response time).
The dashboard should show the top N slowest queries (e.g., top 100) from a rolling 24-hour window.
The dashboard is read by engineers — it does not need sub-second refresh, but should be reasonably up to date (e.g., refreshed every few minutes).
In production, there could be thousands of search servers handling millions of queries per second.
Latency data is generated on the search servers at query completion time.

Examples

Dashboard output (conceptual):

Rank | Query                  | Latency (ms) | Timestamp
-----|------------------------|--------------|--------------------
1    | "rare antique clocks"  | 12,450       | 2024-03-15 14:22:03
2    | "translate pdf 500pg"  | 11,200       | 2024-03-15 09:18:45
3    | "flight NYC to Mars"   | 10,800       | 2024-03-15 22:01:12
...  | ...                    | ...          | ...
100  | "best pizza near me"   | 3,200        | 2024-03-15 16:44:01

Hints & Common Pitfalls

Single Server Design

In-memory min-heap: Maintain a min-heap of size N. For each query, if the latency exceeds the heap's minimum, replace it. This gives O(log N) per query.
Rolling window: You need to expire entries older than 24 hours. Consider periodically scanning the heap, or maintaining a separate time-ordered structure.
Persistence: If the server restarts, the in-memory data is lost. Consider writing periodic snapshots to disk.

Scaling to Multiple Servers

Local aggregation: Each search server maintains its own local top-N. A central aggregator merges the local top-N lists periodically.
Push vs. pull: Servers can push their top-N to the aggregator, or the aggregator can poll servers. Think about tradeoffs (staleness, load, failure handling).
Storage tier: For durability and historical analysis, consider writing slow query logs to a time-series database or log storage system.

Common Discussion Points

What threshold defines "slow"? Top-N is relative, but you might also want an absolute threshold (e.g., > 1 second).
Deduplication: Should the same query text appearing multiple times count as separate entries or be grouped?
Privacy: Query strings may contain sensitive data. Consider anonymization or access controls.
Alerting: Beyond the dashboard, should the system trigger alerts when query latency spikes?

Follow-Up Questions

Failure handling: What happens if one of the search servers goes down? How do you ensure the dashboard remains accurate with partial data?
Historical trends: How would you extend the system to show slow query trends over weeks or months? What storage and aggregation strategy would you use?
Real-time streaming: If the dashboard needs to update in near real-time (every few seconds), how does this change the architecture? Consider streaming frameworks.
Cost optimization: With thousands of servers and millions of QPS, how do you minimize the overhead of collecting and aggregating latency data without missing important outliers?
Multi-datacenter: If search traffic is served from multiple datacenters worldwide, how do you build a unified global dashboard?