Structured Logging and Alerting Pipelines

TL;DR

Structured logs are JSON lines that machines can parse and humans can filter — and paired with a proper alerting pipeline that routes by severity and prevents flapping, they turn your observability stack from "we can see dashboards" into "we can actually debug production."

Why Structured Logging

Alert State Machine

Unstructured logs look like this:

2024-01-15 14:32:05 ERROR PaymentService - Failed to charge user 42: Stripe timeout after 2400ms

A human can read it. A machine can't do much with it. Want to find all Stripe timeouts in the last hour? Regex. Want to count errors by user? More regex. Want to correlate with a trace? Hope the developer remembered to log the trace ID.

Structured logs look like this:

{
  "timestamp": "2024-01-15T14:32:05.123Z",
  "level": "error",
  "service": "payment-service",
  "instance": "payment-7b4d8f-x2k9p",
  "trace_id": "abc123def456",
  "span_id": "span-003",
  "user_id": "user-42",
  "message": "Failed to charge user",
  "error": "Stripe timeout",
  "stripe_request_id": "req_xyz789",
  "duration_ms": 2400,
  "retry_count": 2
}

Every field is a key. Every value is typed. You can filter (error == "Stripe timeout"), aggregate (COUNT(*) WHERE service = "payment-service" GROUP BY error), and correlate (trace_id = "abc123def456" across all services).

Spicy opinion: if your logs aren't structured JSON, they're basically decoration. You'll only read them when grepping on a single server, and once you have more than 10 servers, that approach is dead. Structured logging is non-negotiable for any production system.

What Every Log Line Needs

At minimum, every log entry should include:

Field	Why
`timestamp`	ISO 8601 with timezone. The "when."
`level`	ERROR, WARN, INFO, DEBUG. The severity.
`service`	Which service emitted it.
`instance`	Which pod/host. For debugging specific instances.
`trace_id`	Links to distributed trace. The cross-service thread.
`message`	Human-readable description.

Beyond the basics, add context fields relevant to the operation:

{
  "timestamp": "2024-01-15T14:32:05.123Z",
  "level": "info",
  "service": "order-service",
  "trace_id": "abc123def456",
  "message": "Order created",
  "order_id": "order-9981",
  "user_id": "user-42",
  "total_amount": 59.99,
  "items_count": 3,
  "payment_method": "credit_card",
  "duration_ms": 145
}

Log levels matter. Use them consistently:

ERROR: something broke that needs human attention. Alert-worthy.
WARN: something unexpected happened but the system recovered. Worth investigating if frequent.
INFO: normal operations you'd want to audit. Order created. User logged in. Deployment started.
DEBUG: detailed internals for troubleshooting. Off by default in production. Turn on per-service when debugging.

Don't log at ERROR for things that aren't errors. A user entering a wrong password is not an ERROR — it's an INFO or WARN. Reserve ERROR for genuine system failures.

The Log Pipeline

You can't query JSON files scattered across 500 servers. You need a pipeline that collects, ships, stores, and indexes logs centrally.

Collection Layer

Application → stdout/stderr (container logs)
    │
    ▼
Log Agent (runs on each node):
  ├── Fluentd / Fluent Bit (CNCF, Kubernetes-native)
  ├── Filebeat (Elastic ecosystem)
  ├── Vector (Datadog's open-source agent)
  └── Promtail (Grafana Loki ecosystem)

In Kubernetes, applications log to stdout. The container runtime captures stdout and writes it to files. A log agent (DaemonSet on each node) reads those files, parses the JSON, adds metadata (pod name, namespace, node), and forwards to the aggregation layer.

Fluent Bit is the lightweight choice for Kubernetes. It uses ~15 MB of memory and handles thousands of log lines per second. Fluentd is its bigger sibling with more plugins.

Aggregation and Buffering

Log Agents ──▶ Kafka ──▶ Log Storage
               (buffer)

Kafka between agents and storage serves two purposes:

Buffering: if the storage backend (Elasticsearch) goes down or is slow, Kafka absorbs the backlog. No logs lost.
Fan-out: multiple consumers can read the same log stream. Send to Elasticsearch for search AND to S3 for archival AND to a fraud detection pipeline.

This isn't always necessary. For smaller deployments, agents can ship directly to storage.

Storage and Search

Two dominant options with very different philosophies.

ELK Stack: The Full-Text Search Approach

Elasticsearch + Logstash + Kibana — the classic log management stack.

Applications → Filebeat → Logstash → Elasticsearch → Kibana
              (collect)   (parse,    (index,         (search,
                          enrich)    store)          visualize)

Elasticsearch indexes every field in every log line. This means you can search for any value in any field, with full-text search, fuzzy matching, and aggregations.

# Kibana query (KQL)
service: "payment-service" AND level: "error" AND message: "timeout"

# Elasticsearch query
GET logs-2024.01.15/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"service": "payment-service"}},
        {"term": {"level": "error"}},
        {"match": {"message": "timeout"}}
      ]
    }
  }
}

The problem: Elasticsearch is expensive. It indexes everything, which means every log line consumes 5-10x its original size in storage. At 100 GB/day of raw logs, you need 500 GB - 1 TB/day of Elasticsearch storage. With 30-day retention, that's 15-30 TB.

Elasticsearch also needs tuning: shard sizing, replica count, JVM heap, merge policies, ILM (Index Lifecycle Management). Running it well is a full-time job.

Grafana Loki: The Label-Only Approach

Grafana built Loki as "Prometheus, but for logs." Instead of indexing log content, Loki indexes only the labels (service, level, namespace). The actual log content is stored compressed in object storage (S3) and only searched when you query it.

Applications → Promtail → Loki → Grafana
              (collect,    (index labels,
               add labels) store chunks in S3)

# LogQL (Loki's query language)
{service="payment-service", level="error"} |= "timeout"

# This:
# 1. Finds chunks with matching labels (indexed, fast)
# 2. Scans log content for "timeout" (grep, slower)

Why Loki is cheaper: no full-text index. Log content is compressed and stored in S3 ($0.023/GB). Only labels are indexed. The trade-off: content searches are slower because Loki must scan through compressed chunks. But label-based filtering narrows the scan dramatically.

Loki vs Elasticsearch:

Aspect	Elasticsearch	Grafana Loki
Storage cost	5-10x raw size (indexes)	~1.5x raw size (compressed)
Full-text search	Milliseconds	Seconds (scans chunks)
Label filtering	Milliseconds	Milliseconds
Operations	Complex (JVM tuning, shards)	Simple (S3 backend)
Ecosystem	Kibana, mature plugins	Grafana (same as metrics)
Best for	Security/compliance (need fast search)	Cost-sensitive, Grafana users

Spicy opinion: most teams should default to Loki over Elasticsearch for application logs. The cost difference is 5-10x, and the query speed difference only matters for security incident investigation — which most teams do rarely. Save Elasticsearch for security teams and audit logs. Use Loki for everything else.

Alerting Pipeline

Metrics fire the alerts. But the pipeline between "alert fires" and "human gets paged" is its own architecture.

Prometheus → Alertmanager → Notification Channel

Prometheus evaluates rules every 15s:
  HighErrorRate expr = true?
    │
    ├── First time true → State: PENDING
    │   (wait for `for: 5m`)
    │
    ├── Still true after 5 minutes → State: FIRING
    │   Send to Alertmanager
    │
    └── Becomes false → State: RESOLVED
        Alertmanager sends recovery notification

Alertmanager:
  ├── Deduplicate (same alert from multiple Prometheus)
  ├── Group (batch related alerts into one notification)
  ├── Silence (suppress during maintenance)
  ├── Inhibit (suppress lower-severity if higher fires)
  └── Route to notification channel

Alert States

INACTIVE ──(condition true)──▶ PENDING ──(true for `for` duration)──▶ FIRING
    ▲                              │                                     │
    └──────(condition false)───────┘                                     │
    ▲                                                                    │
    └─────────────────(condition false)──── RESOLVED ◀───────────────────┘

The PENDING state is the key innovation. Without it, every 30-second blip would fire an alert. With a for: 5m clause, the condition must be continuously true for 5 minutes before anyone gets paged. This eliminates flapping — alerts that fire and resolve repeatedly.

# Example: only alert if error rate stays high for 5 minutes
- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical

Alert Routing: Severity to Channel

Not all alerts deserve the same response.

# alertmanager.yml
route:
  receiver: default-slack
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      # P1: pages the on-call engineer immediately

    - match:
        severity: warning
      receiver: slack-engineering
      # P2: Slack channel, needs attention today

    - match:
        severity: info
      receiver: email-weekly
      # P3: informational, batch into weekly digest

Severity	Response Time	Channel	Example
P1 / Critical	Minutes	PagerDuty (pages on-call)	Database is down, error rate > 10%
P2 / Warning	Hours	Slack channel	Disk at 80%, p99 latency elevated
P3 / Info	Days	Email digest	Certificate expiring in 30 days

Routing rules to live by:

P1 alerts must be actionable right now. If you can't do anything about it at 3 AM, it's not a P1.
P2 alerts must be actionable today. If it can wait until next sprint, it's not a P2.
P3 alerts are FYI. If nobody reads them, delete them.

Alert Fatigue: The Real Enemy

Alert fatigue kills on-call teams. When engineers get 50 alerts per day, they start ignoring all of them — including the real ones. PagerDuty's own research shows that teams with more than 10 alerts per on-call shift have significantly higher incident response times.

Causes of alert fatigue:

Too many alerts: every metric has an alert. Most are noise.
Flapping alerts: alert fires, resolves, fires, resolves. Use for clause.
Non-actionable alerts: "CPU is 85%" — what am I supposed to do about it? Replace it with an actionable alert: "Request latency exceeding SLO."
Duplicate alerts: same problem triggers 5 different alerts. Use Alertmanager grouping.
Bad thresholds: set during calm periods, fires constantly under normal load.

Fixes:

Delete noisy alerts. Seriously. If an alert fires weekly and nobody investigates, delete it. It's worse than useless — it trains the team to ignore alerts.
SLO-based alerting. Instead of "CPU > 80%", alert on "error budget burn rate too high." This ties alerts to user impact, not infrastructure metrics.
Weekly alert review. Every week, review which alerts fired. Delete the noisy ones. Tune the thresholds. This is a cultural practice, not a technical one.

Spotify runs a weekly "alert hygiene" meeting where teams review and prune alerts. The result: their on-call engineers get fewer than 5 actionable alerts per week.

SLO-Based Alerting

This is the modern approach. Instead of alerting on symptoms ("CPU high," "memory usage above 90%"), alert on user-facing objectives.

SLO: 99.9% of requests complete under 200ms

Measurement:
  error_budget = 1 - 0.999 = 0.001 (0.1% of requests can be slow)
  30-day budget = 0.001 × 30 × 24 × 60 = 43.2 minutes of downtime

Alert when:
  burn_rate > 14x → will exhaust 30-day budget in ~2 days  → P1
  burn_rate > 6x  → will exhaust 30-day budget in ~5 days  → P1
  burn_rate > 3x  → will exhaust 30-day budget in ~10 days → P2
  burn_rate > 1x  → will exhaust 30-day budget in 30 days  → P3

Google's SRE book (chapter 5) describes this multi-window burn rate alerting. It's the gold standard.

Why this is better: a brief CPU spike doesn't fire an alert unless it actually affects users. And a slow degradation that users barely notice but will eventually exhaust the error budget does fire an alert — before it becomes a problem.

The Three Pillars of Observability

Metrics, traces, and logs are called the three pillars. Each answers a different question.

METRICS: "What is happening?"
  → Error rate is 5%. p99 latency is 2.3 seconds.
  → USE: dashboard, alerting

TRACES: "Where is it happening?"
  → The payment service call to Stripe takes 2.4 seconds.
  → USE: latency debugging, dependency mapping

LOGS: "Why is it happening?"
  → Stripe returned 429 Too Many Requests. Retry limit reached.
  → USE: root cause analysis, audit trail

They're connected via trace IDs:

Alert fires (metric): error rate > 5%
  → Look at Grafana dashboard (metric): error rate spiked at 14:32
  → Find exemplar trace_id from metric: abc123def456
  → Open trace (Jaeger): payment-service span shows 2.4s timeout
  → Search logs (Loki): {trace_id="abc123def456"} → Stripe 429 error
  → Root cause found: Stripe rate limit. Solution: add retry with backoff.

This flow — from alert to metric to trace to log to root cause — is the observability workflow that every on-call engineer should know. It only works when all three pillars share the trace ID.

Patterns for System Design Interviews

Pattern 1: "How do you monitor this system?" "Metrics for alerting (Prometheus + Grafana), traces for debugging latency (OpenTelemetry + Jaeger), structured logs for root cause analysis (Loki). All linked by trace IDs."

Pattern 2: "How do you handle alerting?" "Prometheus evaluates alert rules. Alertmanager routes by severity — P1 pages on-call via PagerDuty, P2 goes to Slack, P3 goes to email. We use SLO-based burn rate alerts to focus on user impact."

Pattern 3: "How do you handle log volume at scale?" "Structured JSON logs. Collected by Fluent Bit, buffered through Kafka, stored in Loki (labels indexed, content in S3). 30-day retention. Full-text search for security-critical logs in Elasticsearch."

Pattern 4: "How do you know when something is wrong?" "We define SLOs (99.9% of requests under 200ms). We measure error budget burn rate. When the burn rate exceeds thresholds, Alertmanager pages the on-call team with the affected service and suggested runbook."

Trade-offs Table

Dimension	ELK Stack	Grafana Loki	Datadog Logs
Cost	High (storage + ops)	Low (S3 backend)	Very high (per GB ingested)
Full-text search	Fast (indexed)	Slow (scans)	Fast (indexed)
Operational burden	High (JVM, shards, ILM)	Low (S3 + single binary)	None (SaaS)
Ecosystem	Kibana, Beats, Logstash	Grafana (unified with metrics)	Datadog platform
Best for	Security, compliance	Cost-sensitive, Grafana users	Teams that buy over build

Log Pipeline

Interview Gotchas

Gotcha 1: "Why not just use console.log everywhere?" Unstructured text is unqueryable at scale. You can't aggregate, filter, or alert on console.log output across 500 pods. Structured logging with proper fields is what enables observability.

Gotcha 2: "How do you handle sensitive data in logs?" Scrub PII (emails, passwords, credit cards) before logs leave the application. Use a log pipeline filter to redact patterns. Never log passwords or API keys, even in debug mode. GDPR and HIPAA require this.

Gotcha 3: "What's the difference between monitoring and alerting?" Monitoring is passive observation (dashboards, metrics). Alerting is active notification (pages, Slack messages). Monitoring tells you something is wrong when you look. Alerting tells you something is wrong whether you're looking or not.

Gotcha 4: "How do you prevent alert fatigue?" Use for clause to prevent flapping. Route by severity. Delete noisy alerts weekly. Switch to SLO-based alerting. An on-call engineer should get fewer than 5-10 actionable alerts per shift.

Gotcha 5: "Do you need all three observability pillars?" For a monolith, metrics and logs are usually enough. For microservices, you need traces too — otherwise you can't debug cross-service latency. The more distributed your system, the more you need all three pillars working together.