Skip to content

Design a Metrics Monitoring Platform

TL;DR

A metrics monitoring platform collects time-series data from thousands of services (CPU usage, request latency, error counts), stores it efficiently, and triggers alerts when thresholds are violated. The hard parts are ingestion at massive throughput (Uber's M3 handles 500 million metrics per second), storage compression (Facebook's Gorilla achieves 12x compression at 1.37 bytes per data point), and alert evaluation that does not flap (a PENDING state between OK and FIRING prevents brief metric spikes from waking engineers at 3 AM). Cardinality explosion -- where a single metric with a high-cardinality label like user_id generates millions of unique time series -- is the most common way to kill a metrics system in production.

The System

A metrics monitoring platform ingests, stores, queries, and alerts on time-series data. Every service in your infrastructure emits metrics: request count, latency percentiles, error rate, CPU usage, memory consumption, queue depth. Operators build dashboards to visualize these metrics and set alerts to be notified when something goes wrong.

Uber's M3 system processes 500 million metrics per second from tens of thousands of microservices. Datadog monitors over 10 billion metrics per day across their customer base. Prometheus, originally built at SoundCloud, is the most widely adopted open-source monitoring system and is the de facto standard for Kubernetes environments. Facebook's Gorilla (now called Beringei) was purpose-built for storing time-series data in memory with extreme compression, serving their ODS (Operational Data Store) that monitors every Facebook service. VictoriaMetrics, a newer entrant, achieves better compression and query performance than Prometheus by separating the ingestion, storage, and query layers -- a key architectural insight for scaling.

Requirements

Functional

  • Metric ingestion: Accept metrics in the form (metric_name, labels, timestamp, value) -- e.g., http_requests_total{service="api", method="GET", status="200"} 1714000000 847293
  • Querying: Support PromQL-style queries: rate(http_requests_total{service="api"}[5m]), histogram_quantile(0.99, ...)
  • Dashboards: Render time-series graphs with configurable time ranges (last 15 minutes to last 1 year)
  • Alerting: Evaluate alert rules against live metrics and fire alerts via PagerDuty, Slack, email when conditions hold for a configured duration
  • Downsampling: Automatically reduce resolution of old data (1s resolution for last 24h, 1m for last 30 days, 1h for last 1 year)
  • Label-based filtering: All queries support filtering by label (key-value pairs attached to each metric)

Non-Functional

  • Ingestion throughput: 10 million metrics per second across the cluster
  • Ingestion latency: Metric available for querying within 15 seconds of emission
  • Query latency: 95th percentile dashboard query completes in under 500ms for a 1-hour time range
  • Storage efficiency: Less than 2 bytes per data point (timestamp + value pair)
  • Retention: Raw data for 15 days, 1-minute downsampled for 1 year, 1-hour downsampled for 5 years
  • Alert evaluation latency: Alerts fire within 30 seconds of threshold violation
  • Availability: 99.95% uptime for ingestion (losing metrics during an outage means you cannot diagnose the outage)

Back-of-Envelope Math

Ingestion:
  10M metrics/sec (aggregate across all services)
  Each data point: timestamp (8 bytes) + value (8 bytes) = 16 bytes uncompressed
  10M * 16 bytes = 160 MB/sec uncompressed ingestion
  With Gorilla compression (1.37 bytes/point): 13.7 MB/sec compressed

Storage (15-day raw retention):
  10M points/sec * 86,400 sec/day * 15 days = 12.96 trillion data points
  At 1.37 bytes/point: 17.8 TB
  At 16 bytes/point (uncompressed): 207 TB
  Compression saves 189 TB. This is why Gorilla compression matters.

Active time series:
  A "time series" is a unique combination of metric name + labels
  Typical: 50M active time series (series that received a data point in last 5 min)
  Each series needs: metadata (labels, ~200 bytes) + in-memory chunk (~2 KB for 2 hours)
  50M * 2.2 KB = 110 GB of hot data in memory

Downsampling:
  1-minute resolution: 10M/sec -> 10M/60 = 167K points/min -> 240M points/day
  1-hour resolution: 10M/3600 = 2,778 points/hour -> 66,667 points/day
  Storage at 1-year (1-min): 240M * 365 * 1.37 bytes = 120 GB. Trivial.
  Storage at 5-year (1-hour): negligible.

Alert evaluation:
  10,000 alert rules, each evaluated every 15 seconds
  10,000 / 15 = 667 rule evaluations/sec
  Each evaluation: query 1-5 time series over a 5-minute window
  667 * 3 series * 300 data points = 600K data point reads/sec
  Easily handled by the in-memory hot data.

The Naive Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Services    │────>│  Collector   │────>│  PostgreSQL  │
│  (emit       │     │  (single)    │     │  (time-series│
│   metrics)   │     │              │     │   table)     │
└──────────────┘     └──────────────┘     └──────────────┘

CREATE TABLE metrics (
    metric_name TEXT,
    labels JSONB,
    timestamp TIMESTAMPTZ,
    value DOUBLE PRECISION,
    PRIMARY KEY (metric_name, labels, timestamp)
);

-- Dashboard query:
SELECT date_trunc('minute', timestamp), avg(value)
FROM metrics
WHERE metric_name = 'http_request_duration_seconds'
AND labels->>'service' = 'api'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY 1
ORDER BY 1;

Row-based storage, full SQL queries, one collector. For 100 metrics/sec from 5 services, this works for months.

Where Does This Break First?

At 10M metrics/sec, PostgreSQL cannot keep up with writes. But even at lower volumes, the query performance degrades because row-based storage reads entire rows when it only needs timestamp and value columns. A 1-hour query across 1 million data points reads ~200 MB of row data instead of ~16 MB of column data.

Where It Breaks

Problem 1: Row-based storage is 10x less efficient for time-series. Time-series queries always read two columns: timestamp and value. Row-based databases read all columns per row. Column-oriented storage reads only the columns needed and compresses them dramatically (timestamps are monotonically increasing, values often change slowly).

Problem 2: Cardinality explosion. A metric like http_requests_total{user_id="..."} with 10 million unique user IDs creates 10 million separate time series. Each needs its own chunk in memory and its own index entry. Your "50M active series" budget is consumed by a single metric. This is the most common production incident in metrics systems -- an engineer adds a high-cardinality label and the metrics system runs out of memory.

Problem 3: Pull vs. push model at scale. Prometheus uses pull: it scrapes each service every 15 seconds. With 10,000 service instances, Prometheus makes 10,000 HTTP requests every 15 seconds = 667 scrapes/sec. At 100,000 instances, it is 6,667 scrapes/sec, and the scrape itself takes 100-500ms. The Prometheus instance becomes CPU-bound on HTTP client overhead.

Problem 4: Single-node Prometheus cannot scale. Prometheus is explicitly designed as a single-node system. It stores data on local disk and queries from local memory. There is no built-in clustering, sharding, or replication. For small deployments (< 1M active series), this is fine and operationally simple. Beyond that, you need Thanos, Cortex, or VictoriaMetrics.

The Real Design

                    ┌────────────────────────────────────────┐
                    │           Services (emit metrics)       │
                    │  service_a -> prometheus_client         │
                    │  service_b -> statsd_client             │
                    └──────────────┬─────────────────────────┘
              ┌────────────────────┼────────────────────┐
              │                    │                    │
     ┌────────v──────┐    ┌────────v──────┐    ┌────────v──────┐
     │  Ingestion    │    │  Ingestion    │    │  Ingestion    │
     │  Node 1       │    │  Node 2       │    │  Node N       │
     │  (vminsert)   │    │  (vminsert)   │    │  (vminsert)   │
     └────────┬──────┘    └────────┬──────┘    └────────┬──────┘
              │                    │                    │
              └────────────────────┼────────────────────┘
                    ┌──────────────v─────────────────────┐
                    │          Storage Layer              │
                    │  (vmstorage nodes, time-partitioned)│
                    │  ┌────────┐  ┌────────┐  ┌────────┐│
                    │  │ Day 1  │  │ Day 2  │  │ Day N  ││
                    │  └────────┘  └────────┘  └────────┘│
                    └──────────────┬─────────────────────┘
              ┌────────────────────┼────────────────────┐
              │                    │                    │
     ┌────────v──────┐    ┌────────v──────┐    ┌────────v──────┐
     │  Query Node 1 │    │  Query Node 2 │    │  Alert        │
     │  (vmselect)   │    │  (vmselect)   │    │  Evaluator    │
     └───────────────┘    └───────────────┘    └───────────────┘

VictoriaMetrics Three-Tier Separation

VictoriaMetrics separates the metrics platform into three independently scalable components. This is the key architectural insight.

vminsert (ingestion): Stateless nodes that accept metrics via Prometheus remote_write, InfluxDB line protocol, or Graphite format. They parse, validate, and forward to vmstorage. Scale horizontally by adding more vminsert nodes.

vmstorage (storage): Stores time-series data on disk with Gorilla compression. Data is time-partitioned (one directory per day). Each vmstorage node handles a shard of the time-series space (sharded by metric name hash). Scale by adding more vmstorage nodes.

vmselect (query): Stateless nodes that receive PromQL queries, fan out to relevant vmstorage nodes, merge results, and return to the client. Scale horizontally by adding more vmselect nodes.

Why this separation matters: Ingestion spikes (deploy of a new service generating 1M new metrics) do not affect query performance. Query storms (100 engineers opening dashboards during an incident) do not affect ingestion. Storage growth does not affect either.

Gorilla Compression (12x, 1.37 Bytes/Point)

Facebook's Gorilla paper (2015) introduced a compression scheme specifically for time-series data that achieves 1.37 bytes per data point (down from 16 bytes uncompressed, a 12x improvement).

Timestamp compression (delta-of-delta):

Timestamps come at regular intervals (e.g., every 15 seconds):
  T0 = 1714000000
  T1 = 1714000015 (delta = 15)
  T2 = 1714000030 (delta = 15, delta-of-delta = 0)
  T3 = 1714000045 (delta = 15, delta-of-delta = 0)
  T4 = 1714000061 (delta = 16, delta-of-delta = 1)

Encoding:
  delta-of-delta = 0: encode as 1 bit (just '0')
  delta-of-delta in [-63, 64]: encode as 2 + 7 = 9 bits
  delta-of-delta in [-255, 256]: encode as 3 + 9 = 12 bits
  Larger: encode as 4 + 32 = 36 bits

For regular 15-second intervals, 95%+ of timestamps encode as 1 bit.
Average: ~2 bits per timestamp (0.25 bytes).

Value compression (XOR encoding):

CPU usage values change slowly:
  V0 = 72.3%
  V1 = 72.5% (XOR with V0: most bits are the same)
  V2 = 72.4% (XOR with V1: most bits are the same)

IEEE 754 double (64 bits):
  XOR(V0, V1) has very few bits set (leading zeros + trailing zeros)
  Encode: number_of_leading_zeros + number_of_meaningful_bits + meaningful_bits

For slowly changing values, XOR encoding uses 1-10 bits per value.
Average: ~9 bits per value (1.12 bytes).

Combined: 0.25 bytes (timestamp) + 1.12 bytes (value) = 1.37 bytes per data point. This is how you store 12.96 trillion data points in 17.8 TB instead of 207 TB.

Pull vs. Push Model

Pull (Prometheus model): The monitoring system scrapes each service's /metrics endpoint at a configured interval (15 seconds). The service exposes its current metric values.

Pros:
  - Monitoring system controls scrape rate (no need to trust services)
  - Easy to detect if a service is down (scrape fails)
  - Service does not need to know about the monitoring system

Cons:
  - O(N) scrape connections (one per service instance)
  - NAT/firewall issues (monitoring must reach services)
  - Short-lived jobs are missed (job finishes before next scrape)

Push (StatsD, Datadog agent model): Services push metrics to a collector endpoint.

Pros:
  - Services behind NAT/firewalls can push out
  - Short-lived jobs push their metrics before exiting
  - Lower connection overhead (services push to a local agent)

Cons:
  - Cannot detect service death by "absence of push" (need a separate health check)
  - Services can overwhelm the collector (no backpressure)
  - Harder to ensure all services are actually emitting metrics

My recommendation: Pull for infrastructure and long-running services (the Prometheus model). Push for short-lived jobs, serverless functions, and services behind NAT. Most production systems support both.

Alert PENDING State (Prevents Flapping)

A CPU spike to 95% for 3 seconds should not wake someone at 3 AM. But if CPU stays at 95% for 5 minutes, it should.

Three-state alert model:

OK -> PENDING -> FIRING -> OK

Alert rule: cpu_usage > 90% for 5 minutes

12:00:00  cpu = 85%  -> OK
12:00:15  cpu = 92%  -> PENDING (threshold crossed, start timer)
12:00:30  cpu = 91%  -> PENDING (still above threshold, 30s elapsed)
12:01:00  cpu = 88%  -> OK (dropped below threshold, reset timer)
12:02:00  cpu = 93%  -> PENDING (threshold crossed again, new timer)
12:02:15  cpu = 95%  -> PENDING (1m 15s elapsed)
...
12:07:00  cpu = 94%  -> FIRING (5 minutes elapsed above threshold)
           -> Send PagerDuty alert
12:07:15  cpu = 94%  -> FIRING (still above threshold, do NOT re-alert)
12:08:00  cpu = 85%  -> OK (resolved)
           -> Send PagerDuty resolve

The PENDING state is the crucial difference between a usable alerting system and an alert cannon. Without it, every brief CPU spike sends an alert, engineers start ignoring alerts (alert fatigue), and when a real incident happens, the alert is lost in the noise.

Hysteresis: Some alert systems add a different threshold for resolving (e.g., alert at 90%, resolve at 80%). This prevents flapping when the metric oscillates around the threshold: 91%, 89%, 91%, 89% would cause continuous OK->FIRING->OK->FIRING transitions without hysteresis.

Deep Dives

Metrics Monitoring — Metrics Monitoring High-Level Design

Deep Dive 1: Cardinality Explosion and Prevention

Cardinality is the number of unique time series. Each unique combination of metric name + label values = one time series.

http_requests_total{service="api", method="GET", status="200"}     -> 1 series
http_requests_total{service="api", method="GET", status="404"}     -> 1 series
http_requests_total{service="api", method="POST", status="200"}    -> 1 series

With 3 services * 5 methods * 10 statuses = 150 series. Manageable.

But add user_id as a label:
http_requests_total{service="api", method="GET", user_id="user_123"}
With 10M users: 10M * 3 * 5 * 10 = 1.5 billion series. 

Memory: 1.5B * 2.2 KB = 3.3 TB. Your metrics system is dead.

Prevention strategies:

  1. Label validation at ingestion: Reject metrics with labels known to be high-cardinality (user_id, request_id, trace_id). Configure a blocklist.

  2. Series count limits per metric: If a metric generates > 100K unique series in a 5-minute window, drop new series and alert the metric owner.

  3. Relabeling: Prometheus relabel_configs can drop or modify labels at scrape time. Drop the pod_name label if it creates too many series (use deployment instead).

  4. Pre-aggregation: Instead of recording per-user request counts, aggregate to per-service counts at the application level and export the pre-aggregated metric.

Monitoring cardinality: Track scrape_series_added (Prometheus metric) or vm_new_timeseries_created_total (VictoriaMetrics). Alert if the rate exceeds a threshold. Datadog charges per active time series, so cardinality explosion is directly visible on your bill.

Deep Dive 2: Downsampling and Retention Tiers

Raw 15-second data is essential for debugging a recent incident but wasteful for "what was CPU usage last month?"

Downsampling pipeline:

Tier 1: Raw (15-second resolution)
  Retention: 15 days
  Storage: 17.8 TB
  Query use case: "What happened in the last hour?"

Tier 2: 1-minute aggregates (min, max, avg, count per minute)
  Retention: 1 year
  Generated: background job processes raw data, writes aggregates
  Storage: 120 GB
  Query use case: "What was the trend this month?"

Tier 3: 1-hour aggregates
  Retention: 5 years
  Storage: ~2 GB
  Query use case: "Year-over-year comparison"

Automatic tier selection: When a user queries a 1-hour time range, use Tier 1 (raw). For a 7-day range, use Tier 2 (1-minute). For a 6-month range, use Tier 3 (1-hour). The query engine automatically selects the appropriate tier based on the requested time range and desired graph resolution.

Thanos approach: Thanos stores downsampled data in object storage (S3). Three copies at different resolutions: raw, 5-minute, and 1-hour. A compactor periodically reads raw blocks and produces downsampled blocks. Older raw blocks are deleted per retention policy.

Deep Dive 3: High Availability for the Alert Evaluator

If the alert evaluator is down, alerts do not fire. During a production incident, this is catastrophic -- the very system that should notify you is the one that is broken.

Dual-evaluation with deduplication:

Run two independent alert evaluator instances, both evaluating every rule. Each sends alerts to the alert manager (e.g., Alertmanager). Alertmanager deduplicates by alert fingerprint (hash of alert name + labels).

Evaluator A: cpu_alert{service="api"} FIRING at 12:07:00
Evaluator B: cpu_alert{service="api"} FIRING at 12:07:15

Alertmanager: receives both, groups by fingerprint, sends ONE notification.

Prometheus Alertmanager supports this natively with its clustering mode. Multiple Alertmanager instances form a gossip cluster and coordinate to ensure each alert is routed to exactly one notification (PagerDuty, Slack, etc.).

Separate failure domain: The alert evaluator should NOT run on the same infrastructure being monitored. If your Kubernetes cluster is down, and the alert evaluator runs on Kubernetes, the alert evaluator is also down. Run the alert evaluator on a separate set of machines, in a separate availability zone, with independent monitoring.

Alternative Designs

Alternative 1: Managed SaaS (Datadog, New Relic)

Send all metrics to a managed platform. No infrastructure to operate.

Alternative 2: Prometheus + Thanos (Federated Open Source)

Multiple Prometheus instances (one per cluster), with Thanos for global querying, long-term storage (S3), and downsampling.

Alternative 3: InfluxDB Cluster

Purpose-built time-series database with SQL-like query language (Flux/InfluxQL), built-in retention policies, and continuous queries for downsampling.

Aspect VictoriaMetrics Datadog (Managed) Prometheus + Thanos InfluxDB
Ingestion throughput 10M+ metrics/sec Unlimited (managed) ~1M/sec per Prometheus ~500K metrics/sec
Storage efficiency 0.4-1.5 bytes/point Managed 1.5-2 bytes/point 2-3 bytes/point
Query language PromQL + MetricsQL Proprietary PromQL Flux / InfluxQL
Long-term storage Built-in, local disk Managed S3 via Thanos Built-in or S3
Operational complexity Low (single binary) None High (many components) Medium
Cost at 10M metrics/sec ~$5K/mo (infra) ~$50K-100K/mo ~$8K/mo (infra + S3) ~$10K/mo
Alerting Compatible with AM Built-in Prometheus Alertmanager Built-in
Cardinality handling High (optimized engine) Limits per plan Medium (single node) Medium

VictoriaMetrics for high-volume self-hosted deployments where cost matters. Datadog for organizations where operational simplicity is worth the premium (engineering time saved > SaaS cost). Prometheus + Thanos for Kubernetes-native organizations already running Prometheus. InfluxDB for smaller deployments or teams that prefer SQL-like querying.

Scaling Math Verification

Ingestion (10M metrics/sec):

  • vminsert nodes: 10M / 1M per node = 10 nodes (VictoriaMetrics benchmarks show ~1M metrics/sec per vminsert)
  • Network: 10M * 200 bytes (metric with labels) = 2 GB/sec ingestion bandwidth
  • Per vminsert: 200 MB/sec. 10 Gbps NIC handles this.

Storage (15-day raw retention):

  • Gorilla compression: 1.37 bytes/point * 10M points/sec * 86,400 sec/day * 15 days = 17.8 TB
  • vmstorage nodes: 4 nodes with 5 TB NVMe each = 20 TB. Fine with headroom.
  • Write throughput per node: 10M / 4 = 2.5M points/sec per node. Sequential writes (append-only) at ~3.4 MB/sec compressed. NVMe handles this easily.

Query performance:

  • 1-hour query on 1 time series: 240 data points * 1.37 bytes = 329 bytes. Microseconds to read.
  • 1-hour query across 1000 time series: 329 * 1000 = 329 KB. Sub-millisecond.
  • Dashboard with 20 panels, each querying 100 series over 6 hours: 20 * 100 * 1440 * 1.37 bytes = 3.9 MB. Under 100ms on SSD.

Alert evaluation (667 rules/sec):

  • Each rule queries 5 series over 5 minutes: 5 * 20 points = 100 points per rule
  • 667 * 100 = 66,700 points/sec from hot storage. Trivial.

Failure Analysis

Component Current capacity At 10x (100M metrics/sec) Breaks? Fix
vminsert (10 nodes) 1M metrics/node/sec 10M needed per node Yes Scale to 100 vminsert nodes
vmstorage (4 nodes) 17.8 TB / 15 days 178 TB / 15 days Yes Scale to 40 vmstorage nodes, use larger disks
Network ingestion 2 GB/sec 20 GB/sec Yes Multiple ingestion endpoints, load balancing
Active series memory 110 GB 1.1 TB Yes More vmstorage nodes, each holding a shard
vmselect (query) Sub-100ms for 1-hour Same query complexity No --
Alert evaluator 667 rules/sec Same (rules don't scale with data) No --
Downsampling 17.8 TB raw -> 120 GB 1min 178 TB -> 1.2 TB Maybe Parallelize compaction across nodes

The first bottleneck at 10x is everything ingestion-related: vminsert count, storage capacity, and network bandwidth. All scale linearly. The more interesting challenge is active series memory: 1.1 TB across 40 vmstorage nodes means each node holds ~27.5 GB of active series. This is feasible with 64 GB RAM per node, dedicating ~50% to the active series cache.

At 100x (1B metrics/sec), you are Uber M3 scale. At that point, the primary challenge shifts from storage and ingestion to the query layer: aggregating results from 400 storage nodes within a 500ms latency budget requires careful query planning, parallel fan-out, and early pruning of irrelevant nodes.

What's Expected at Each Level

Aspect Mid-Level Senior Staff+
Data model Metric name + value + timestamp Labels/tags, time series as unique name+labels Cardinality implications, label design principles
Storage "Use a time-series database" Gorilla compression mentioned, column-oriented Delta-of-delta timestamps, XOR values, 1.37 bytes/point math
Ingestion "Services push metrics" Pull vs push trade-offs Three-tier separation (ingest/store/query), agent vs pushgateway
Alerting "Set threshold, send alert" PENDING state, for duration clause Hysteresis, dual-evaluation HA, separate failure domain
Cardinality Not mentioned "Don't use high-cardinality labels" Quantifies the explosion, prevention strategies, monitoring
Downsampling Not mentioned "Keep less data for old time ranges" Tiered retention, automatic tier selection, Thanos compactor
Scaling Single Prometheus Federation or sharding VictoriaMetrics 3-tier, independent scaling per component
Real-world reference "Like Prometheus" Mentions Gorilla compression Uber M3 (500M/sec), VictoriaMetrics architecture, Datadog cost

The single most important signal at any level: do you understand that the monitoring system must be more reliable than the systems it monitors? If your monitoring runs on the same infrastructure and that infrastructure fails, you are blind during the outage. The alert evaluator, in particular, must run in a separate failure domain.


References from Our Courses


Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Metrics Monitoring Platform →