Skip to content

Dashboard for Monitoring Slow Queries

Level: L4-L6 Topics: System Design, Monitoring, Distributed Systems

Problem Statement

Design an internet-facing dashboard that displays the slowest search queries from the last 24 hours. The goal is to help engineers identify and debug performance problems.

Start by designing the system for a single server handling all search traffic. Then scale the design to handle the full production load across many servers.

This is a discussion-based system design problem. You should walk through requirements, architecture, data flow, and scaling considerations.

Background & Constraints

  • A "search query" has at least: query string, timestamp, and latency (response time).
  • The dashboard should show the top N slowest queries (e.g., top 100) from a rolling 24-hour window.
  • The dashboard is read by engineers — it does not need sub-second refresh, but should be reasonably up to date (e.g., refreshed every few minutes).
  • In production, there could be thousands of search servers handling millions of queries per second.
  • Latency data is generated on the search servers at query completion time.

Examples

Dashboard output (conceptual):

Rank | Query                  | Latency (ms) | Timestamp
-----|------------------------|--------------|--------------------
1    | "rare antique clocks"  | 12,450       | 2024-03-15 14:22:03
2    | "translate pdf 500pg"  | 11,200       | 2024-03-15 09:18:45
3    | "flight NYC to Mars"   | 10,800       | 2024-03-15 22:01:12
...  | ...                    | ...          | ...
100  | "best pizza near me"   | 3,200        | 2024-03-15 16:44:01

Hints & Common Pitfalls

Single Server Design

  • In-memory min-heap: Maintain a min-heap of size N. For each query, if the latency exceeds the heap's minimum, replace it. This gives O(log N) per query.
  • Rolling window: You need to expire entries older than 24 hours. Consider periodically scanning the heap, or maintaining a separate time-ordered structure.
  • Persistence: If the server restarts, the in-memory data is lost. Consider writing periodic snapshots to disk.

Scaling to Multiple Servers

  • Local aggregation: Each search server maintains its own local top-N. A central aggregator merges the local top-N lists periodically.
  • Push vs. pull: Servers can push their top-N to the aggregator, or the aggregator can poll servers. Think about tradeoffs (staleness, load, failure handling).
  • Storage tier: For durability and historical analysis, consider writing slow query logs to a time-series database or log storage system.

Common Discussion Points

  • What threshold defines "slow"? Top-N is relative, but you might also want an absolute threshold (e.g., > 1 second).
  • Deduplication: Should the same query text appearing multiple times count as separate entries or be grouped?
  • Privacy: Query strings may contain sensitive data. Consider anonymization or access controls.
  • Alerting: Beyond the dashboard, should the system trigger alerts when query latency spikes?

Follow-Up Questions

  1. Failure handling: What happens if one of the search servers goes down? How do you ensure the dashboard remains accurate with partial data?

  2. Historical trends: How would you extend the system to show slow query trends over weeks or months? What storage and aggregation strategy would you use?

  3. Real-time streaming: If the dashboard needs to update in near real-time (every few seconds), how does this change the architecture? Consider streaming frameworks.

  4. Cost optimization: With thousands of servers and millions of QPS, how do you minimize the overhead of collecting and aggregating latency data without missing important outliers?

  5. Multi-datacenter: If search traffic is served from multiple datacenters worldwide, how do you build a unified global dashboard?