Metrics — Prometheus, Pull Model, and PromQL

TL;DR

Prometheus scrapes your services for metrics on a schedule, stores them in a time-series database, and lets you query and alert on them with PromQL — and its pull model is why it's the default for Kubernetes-native monitoring.

What Metrics Are

Gorilla Compression

A metric is a numeric measurement collected over time. CPU usage at 14:32:05. Request count at 14:32:10. Error rate at 14:32:15. Each data point is a (metric_name, labels, value, timestamp) tuple.

http_requests_total{method="GET", handler="/api/users", status="200"} 14523 1705276800
│                   │                                                  │     │
metric name         labels (dimensions)                                value timestamp

Metrics answer "what is happening?" in aggregate. Not "why is it happening" (that's logs) or "where in the call chain is it happening" (that's traces). Metrics are cheap to store, fast to query, and the first thing you check during an incident.

SoundCloud built Prometheus in 2012 because they needed metrics for their microservices migration and existing tools couldn't keep up. It became the second project to graduate from the CNCF (after Kubernetes). That pedigree matters — Prometheus and Kubernetes were designed for the same world.

The Four Metric Types

Prometheus defines four metric types. Knowing which one to use is half the battle.

Counter

A monotonically increasing number. It only goes up (or resets to zero on process restart).

# Total HTTP requests served (ever)
http_requests_total{method="GET", status="200"} 142367

# Total errors
errors_total{service="payment", type="timeout"} 89

Use for: request counts, error counts, bytes transferred, tasks completed. Anything that accumulates.

Never do: counter - previous_counter in your head. Use rate() in PromQL, which handles counter resets.

# Request rate over the last 5 minutes
rate(http_requests_total{status="200"}[5m])
# Returns: requests per second (e.g., 42.5)

Gauge

A value that goes up and down. A snapshot of something right now.

# Current CPU usage
node_cpu_usage_percent{instance="server-1"} 73.2

# Current queue depth
queue_depth{queue="orders"} 1547

# Temperature of server room
room_temperature_celsius{room="dc-east-1"} 22.4

Use for: CPU usage, memory usage, queue depth, active connections, temperature. Anything with a current value.

Histogram

Records the distribution of values by putting them into configured buckets.

# Request duration histogram
http_request_duration_seconds_bucket{le="0.01"} 2400   # ≤10ms
http_request_duration_seconds_bucket{le="0.05"} 8900   # ≤50ms
http_request_duration_seconds_bucket{le="0.1"}  12300  # ≤100ms
http_request_duration_seconds_bucket{le="0.5"}  14100  # ≤500ms
http_request_duration_seconds_bucket{le="1.0"}  14500  # ≤1s
http_request_duration_seconds_bucket{le="+Inf"} 14523  # all
http_request_duration_seconds_sum 4521.3               # total seconds
http_request_duration_seconds_count 14523              # total requests

Use for: request latency, response sizes, batch processing times. Anything where the distribution matters more than the average.

The key feature: you can compute percentiles server-side using histogram_quantile().

# p99 latency over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Spicy opinion: always use histograms over summaries for latency. Histograms are aggregatable across instances — you can compute the p99 of your entire fleet. Summaries compute percentiles on each instance and can't be meaningfully aggregated. A summary p99 from 10 servers tells you nothing useful about the global p99.

Summary

Similar to histogram but computes percentiles client-side (in the application).

# Summary with pre-computed quantiles
http_request_duration_seconds{quantile="0.5"}  0.042
http_request_duration_seconds{quantile="0.9"}  0.087
http_request_duration_seconds{quantile="0.99"} 0.235

Use for: when you need exact percentiles per-instance and don't need to aggregate across instances. In practice, histograms are almost always better.

Pull Model: Prometheus Scrapes You

Prometheus uses a pull model. Instead of services pushing metrics to a collector, Prometheus periodically scrapes an HTTP endpoint on each service.

Every 15 seconds:

Prometheus ──GET /metrics──▶ Service A
Prometheus ──GET /metrics──▶ Service B
Prometheus ──GET /metrics──▶ Service C

Service exposes:
  GET /metrics →
    http_requests_total{method="GET"} 1423
    http_requests_total{method="POST"} 89
    process_cpu_seconds_total 342.7
    go_goroutines 47

Service discovery tells Prometheus what to scrape. In Kubernetes, Prometheus watches the Kubernetes API for pods with specific annotations:

# Kubernetes pod annotation
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

New pods appear? Prometheus finds them automatically. Pods die? Prometheus stops scraping them. No configuration changes needed.

Pull vs Push

Aspect	Pull (Prometheus)	Push (Datadog, StatsD)
Debugging	`curl /metrics` on any service	Need the collector to be running
Service discovery	Prometheus needs to find targets	Targets push to known endpoint
Firewalls	Prometheus must reach services	Services must reach collector
Short-lived jobs	Misses jobs that start and finish between scrapes	Captures everything
Backpressure	Prometheus controls scrape rate	Services can overwhelm collector
Health detection	Failed scrape = service might be down	Silence = ambiguous

When pull doesn't work: serverless functions (Lambda, Cloud Functions), batch jobs that run for seconds, services behind strict firewalls. For these, Prometheus provides the Pushgateway — a bridge that accepts pushed metrics and exposes them for scraping.

# Push a metric to Pushgateway after a batch job completes
echo "batch_job_duration_seconds 42.3" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/nightly_etl

PromQL: The Query Language

PromQL is what makes Prometheus useful. Without it, you'd just have a pile of numbers.

Essential Queries

# Current value of a metric
http_requests_total{method="GET", status="200"}

# Rate of change (requests per second over 5m window)
rate(http_requests_total[5m])

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# Average over time (useful for gauges)
avg_over_time(cpu_usage_percent[1h])

# Sum across all instances, grouped by handler
sum by (handler) (rate(http_requests_total[5m]))

# Top 5 endpoints by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

Alerting Rules

PromQL powers alerting. An alert fires when a PromQL expression evaluates to true.

# prometheus_rules.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m    # must be true for 5 minutes before firing
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 1 second for 10 minutes"

The for clause is critical. Without it, a 30-second spike would page someone at 3 AM. With for: 5m, the condition must persist for 5 minutes. This prevents false alarms from transient spikes.

TSDB Internals: How Prometheus Stores Data

Prometheus stores metrics in its own custom time-series database (TSDB). Understanding the storage model explains both its strengths and its limits.

Storage Architecture

Prometheus TSDB:
┌──────────────────────────────────────────┐
│ Head Block (last 2 hours, in-memory)     │
│  ├── WAL (Write-Ahead Log on disk)       │
│  └── In-memory chunks (compressed)       │
├──────────────────────────────────────────┤
│ Block 1 (hours 2-4, on disk)             │
│  ├── chunks/ (compressed time series)    │
│  ├── index (inverted index on labels)    │
│  └── meta.json                           │
├──────────────────────────────────────────┤
│ Block 2 (hours 4-6, on disk)             │
│  ├── chunks/                             │
│  ├── index                               │
│  └── meta.json                           │
└──────────────────────────────────────────┘

Head block: the last 2 hours of data, in memory. Writes go to the WAL first (crash recovery), then into in-memory chunks.
Persistent blocks: every 2 hours, the head block is flushed to disk as an immutable block.
Compaction: background process merges smaller blocks into larger ones. Reduces file count and improves query performance.

Gorilla Compression

Prometheus uses Facebook's Gorilla compression algorithm (from their 2015 paper). It exploits the fact that consecutive metric values are often similar.

Raw: 73.2, 73.5, 73.1, 74.0, 73.8, 73.9
XOR with previous: 0, 0.3, -0.4, 0.9, -0.2, 0.1

The XOR values are small, so they compress well.
Result: ~1.37 bytes per data point (vs 16 bytes raw).

That's a 12x compression ratio. A Prometheus server with 1 million active time series, scraped every 15 seconds, uses roughly 2-3 GB of RAM for the head block.

Retention

Prometheus defaults to 15 days of local retention. For longer retention, use:

Thanos: reads from Prometheus, uploads blocks to S3, provides global querying across multiple Prometheus instances.
Cortex / Mimir: Grafana's scalable metrics backend. Receives data via remote write, stores in S3.
VictoriaMetrics: Prometheus-compatible, better compression, long-term storage.

Cardinality: The #1 Prometheus Killer

Cardinality is the number of unique time series. Each unique combination of metric name + label values creates one time series.

# 3 methods × 5 handlers × 4 status codes = 60 time series
http_requests_total{method="GET",  handler="/users",  status="200"}
http_requests_total{method="POST", handler="/users",  status="201"}
http_requests_total{method="GET",  handler="/orders", status="404"}
...

# Now add a user_id label (1 million users):
http_requests_total{method="GET", handler="/users", status="200", user_id="12345"}
# 3 × 5 × 4 × 1,000,000 = 60 MILLION time series
# Prometheus dies. OOM. Game over.

Spicy opinion: cardinality explosion is responsible for more Prometheus outages than any other cause. It's always the same mistake: someone adds a high-cardinality label (user ID, request ID, IP address, trace ID) to a metric. Don't do it. Labels should have bounded, low cardinality — typically under 100 distinct values.

Rules for labels:

Status codes, HTTP methods, service names, regions: good (bounded)
User IDs, IP addresses, request IDs, email addresses: bad (unbounded)
If the label's distinct values grow with traffic or users, don't use it as a label

Scaling Prometheus

A single Prometheus server handles 1-10 million time series comfortably. Beyond that, you need to scale.

Federation

┌─────────────────────────────────────────┐
│ Global Prometheus (aggregated metrics)  │
│  scrapes /federate from regional        │
└─────────┬──────────────┬────────────────┘
          │              │
    ┌─────▼─────┐  ┌─────▼─────┐
    │ Regional  │  │ Regional  │
    │ Prometheus│  │ Prometheus│
    │ (US-East) │  │ (EU-West) │
    └───────────┘  └───────────┘

Each regional Prometheus scrapes its own targets. A global Prometheus scrapes pre-aggregated metrics from the regional instances.

Thanos / Mimir

For true horizontal scaling and long-term storage:

Prometheus (region A) ──remote-write──▶ Thanos Receive ──▶ S3
Prometheus (region B) ──remote-write──▶ Thanos Receive ──▶ S3

Thanos Query ──reads from──▶ Thanos Store (S3) + Prometheus (live)

Thanos provides a unified query interface across multiple Prometheus instances and S3-backed historical data. Grafana Labs' Mimir does the same thing with better multi-tenancy.

VictoriaMetrics

An alternative to the Prometheus+Thanos stack. Prometheus-compatible (same query language, same scrape format) but built as a clustered system from the start.

vminsert (receives data) → vmstorage (stores data) → vmselect (queries data)

VictoriaMetrics achieves better compression than Prometheus (up to 10x) and handles higher cardinality more gracefully. It's a compelling choice for teams that outgrow a single Prometheus server but don't want the complexity of Thanos.

Patterns for System Design Interviews

Pattern 1: Mention metrics for any system. "We expose metrics on each service — request count, latency histogram, error rate. Prometheus scrapes them every 15 seconds." Shows you think about production readiness.

Pattern 2: The RED method. For services: Rate (requests/second), Errors (error rate), Duration (latency distribution). Mention this and you look like you've actually operated a production service.

Pattern 3: The USE method. For infrastructure: Utilization (CPU%, memory%), Saturation (queue depth, thread pool exhaustion), Errors. Coined by Brendan Gregg at Netflix.

Pattern 4: SLOs from metrics. "Our SLO is 99.9% of requests under 200ms. Prometheus tracks this via histogram_quantile(0.999, ...). Alertmanager fires when the SLO is at risk."

Trade-offs Table

Dimension	Advantage	Disadvantage
Pull model	Easy debugging, natural health detection	Doesn't work for serverless/short-lived
PromQL	Powerful, standard for Kubernetes	Steep learning curve
Local storage	Fast, no external dependency	Limited retention, single-node
Label model	Flexible, multi-dimensional	Cardinality explosion risk
Ecosystem	Grafana, Alertmanager, exporters	Can't replace commercial APM entirely
Compression	1.37 bytes/point (Gorilla)	Still memory-bound for head block
Community	Massive, CNCF graduated	Fragmented scaling solutions

Prometheus Pull

Interview Gotchas

Gotcha 1: "How do you monitor Prometheus itself?" Another Prometheus instance (cross-monitoring), or a separate system like VictoriaMetrics. Never leave your monitoring unmonitored. This sounds recursive, but it's standard practice.

Gotcha 2: "Why not use Datadog instead of Prometheus?" Datadog is SaaS (push model, hosted storage, managed dashboards). Prometheus is open-source (self-hosted, pull model). Datadog costs $15-23 per host per month — at 1,000 hosts, that's $15-23K/month. Prometheus costs infrastructure only. The trade-off is money vs operational burden.

Gotcha 3: "What happens if a scrape fails?" Prometheus marks the target as down (up{instance="..."} = 0). The next scrape retries. You can alert on up == 0. Failed scrapes don't create gaps in existing metrics — they just stop updating until scraping resumes.

Gotcha 4: "How do you handle metrics in a multi-region deployment?" One Prometheus per region, scraping local targets. Either federation or Thanos/Mimir for a global view. Don't try to scrape across regions — the latency makes scrape timeouts unreliable.

Gotcha 5: "What's the difference between monitoring and observability?" Monitoring tells you when something is broken (dashboards, alerts). Observability lets you ask why it's broken without deploying new code (metrics, traces, and logs together). Metrics alone are monitoring. All three pillars together are observability.