Structured Logging and Alerting Pipelines
TL;DR
Structured logs are JSON lines that machines can parse and humans can filter — and paired with a proper alerting pipeline that routes by severity and prevents flapping, they turn your observability stack from "we can see dashboards" into "we can actually debug production."
Why Structured Logging

Unstructured logs look like this:
A human can read it. A machine can't do much with it. Want to find all Stripe timeouts in the last hour? Regex. Want to count errors by user? More regex. Want to correlate with a trace? Hope the developer remembered to log the trace ID.
Structured logs look like this:
{
"timestamp": "2024-01-15T14:32:05.123Z",
"level": "error",
"service": "payment-service",
"instance": "payment-7b4d8f-x2k9p",
"trace_id": "abc123def456",
"span_id": "span-003",
"user_id": "user-42",
"message": "Failed to charge user",
"error": "Stripe timeout",
"stripe_request_id": "req_xyz789",
"duration_ms": 2400,
"retry_count": 2
}
Every field is a key. Every value is typed. You can filter (error == "Stripe timeout"), aggregate (COUNT(*) WHERE service = "payment-service" GROUP BY error), and correlate (trace_id = "abc123def456" across all services).
Spicy opinion: if your logs aren't structured JSON, they're basically decoration. You'll only read them when grepping on a single server, and once you have more than 10 servers, that approach is dead. Structured logging is non-negotiable for any production system.
What Every Log Line Needs
At minimum, every log entry should include:
| Field | Why |
|---|---|
timestamp |
ISO 8601 with timezone. The "when." |
level |
ERROR, WARN, INFO, DEBUG. The severity. |
service |
Which service emitted it. |
instance |
Which pod/host. For debugging specific instances. |
trace_id |
Links to distributed trace. The cross-service thread. |
message |
Human-readable description. |
Beyond the basics, add context fields relevant to the operation:
{
"timestamp": "2024-01-15T14:32:05.123Z",
"level": "info",
"service": "order-service",
"trace_id": "abc123def456",
"message": "Order created",
"order_id": "order-9981",
"user_id": "user-42",
"total_amount": 59.99,
"items_count": 3,
"payment_method": "credit_card",
"duration_ms": 145
}
Log levels matter. Use them consistently:
- ERROR: something broke that needs human attention. Alert-worthy.
- WARN: something unexpected happened but the system recovered. Worth investigating if frequent.
- INFO: normal operations you'd want to audit. Order created. User logged in. Deployment started.
- DEBUG: detailed internals for troubleshooting. Off by default in production. Turn on per-service when debugging.
Don't log at ERROR for things that aren't errors. A user entering a wrong password is not an ERROR — it's an INFO or WARN. Reserve ERROR for genuine system failures.
The Log Pipeline
You can't query JSON files scattered across 500 servers. You need a pipeline that collects, ships, stores, and indexes logs centrally.
Collection Layer
Application → stdout/stderr (container logs)
│
▼
Log Agent (runs on each node):
├── Fluentd / Fluent Bit (CNCF, Kubernetes-native)
├── Filebeat (Elastic ecosystem)
├── Vector (Datadog's open-source agent)
└── Promtail (Grafana Loki ecosystem)
In Kubernetes, applications log to stdout. The container runtime captures stdout and writes it to files. A log agent (DaemonSet on each node) reads those files, parses the JSON, adds metadata (pod name, namespace, node), and forwards to the aggregation layer.
Fluent Bit is the lightweight choice for Kubernetes. It uses ~15 MB of memory and handles thousands of log lines per second. Fluentd is its bigger sibling with more plugins.
Aggregation and Buffering
Kafka between agents and storage serves two purposes:
- Buffering: if the storage backend (Elasticsearch) goes down or is slow, Kafka absorbs the backlog. No logs lost.
- Fan-out: multiple consumers can read the same log stream. Send to Elasticsearch for search AND to S3 for archival AND to a fraud detection pipeline.
This isn't always necessary. For smaller deployments, agents can ship directly to storage.
Storage and Search
Two dominant options with very different philosophies.
ELK Stack: The Full-Text Search Approach
Elasticsearch + Logstash + Kibana — the classic log management stack.
Applications → Filebeat → Logstash → Elasticsearch → Kibana
(collect) (parse, (index, (search,
enrich) store) visualize)
Elasticsearch indexes every field in every log line. This means you can search for any value in any field, with full-text search, fuzzy matching, and aggregations.
# Kibana query (KQL)
service: "payment-service" AND level: "error" AND message: "timeout"
# Elasticsearch query
GET logs-2024.01.15/_search
{
"query": {
"bool": {
"must": [
{"term": {"service": "payment-service"}},
{"term": {"level": "error"}},
{"match": {"message": "timeout"}}
]
}
}
}
The problem: Elasticsearch is expensive. It indexes everything, which means every log line consumes 5-10x its original size in storage. At 100 GB/day of raw logs, you need 500 GB - 1 TB/day of Elasticsearch storage. With 30-day retention, that's 15-30 TB.
Elasticsearch also needs tuning: shard sizing, replica count, JVM heap, merge policies, ILM (Index Lifecycle Management). Running it well is a full-time job.
Grafana Loki: The Label-Only Approach
Grafana built Loki as "Prometheus, but for logs." Instead of indexing log content, Loki indexes only the labels (service, level, namespace). The actual log content is stored compressed in object storage (S3) and only searched when you query it.
# LogQL (Loki's query language)
{service="payment-service", level="error"} |= "timeout"
# This:
# 1. Finds chunks with matching labels (indexed, fast)
# 2. Scans log content for "timeout" (grep, slower)
Why Loki is cheaper: no full-text index. Log content is compressed and stored in S3 ($0.023/GB). Only labels are indexed. The trade-off: content searches are slower because Loki must scan through compressed chunks. But label-based filtering narrows the scan dramatically.
Loki vs Elasticsearch:
| Aspect | Elasticsearch | Grafana Loki |
|---|---|---|
| Storage cost | 5-10x raw size (indexes) | ~1.5x raw size (compressed) |
| Full-text search | Milliseconds | Seconds (scans chunks) |
| Label filtering | Milliseconds | Milliseconds |
| Operations | Complex (JVM tuning, shards) | Simple (S3 backend) |
| Ecosystem | Kibana, mature plugins | Grafana (same as metrics) |
| Best for | Security/compliance (need fast search) | Cost-sensitive, Grafana users |
Spicy opinion: most teams should default to Loki over Elasticsearch for application logs. The cost difference is 5-10x, and the query speed difference only matters for security incident investigation — which most teams do rarely. Save Elasticsearch for security teams and audit logs. Use Loki for everything else.
Alerting Pipeline
Metrics fire the alerts. But the pipeline between "alert fires" and "human gets paged" is its own architecture.
Prometheus → Alertmanager → Notification Channel
Prometheus evaluates rules every 15s:
HighErrorRate expr = true?
│
├── First time true → State: PENDING
│ (wait for `for: 5m`)
│
├── Still true after 5 minutes → State: FIRING
│ Send to Alertmanager
│
└── Becomes false → State: RESOLVED
Alertmanager sends recovery notification
Alertmanager:
├── Deduplicate (same alert from multiple Prometheus)
├── Group (batch related alerts into one notification)
├── Silence (suppress during maintenance)
├── Inhibit (suppress lower-severity if higher fires)
└── Route to notification channel
Alert States
INACTIVE ──(condition true)──▶ PENDING ──(true for `for` duration)──▶ FIRING
▲ │ │
└──────(condition false)───────┘ │
▲ │
└─────────────────(condition false)──── RESOLVED ◀───────────────────┘
The PENDING state is the key innovation. Without it, every 30-second blip would fire an alert. With a for: 5m clause, the condition must be continuously true for 5 minutes before anyone gets paged. This eliminates flapping — alerts that fire and resolve repeatedly.
# Example: only alert if error rate stays high for 5 minutes
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
Alert Routing: Severity to Channel
Not all alerts deserve the same response.
# alertmanager.yml
route:
receiver: default-slack
routes:
- match:
severity: critical
receiver: pagerduty-oncall
# P1: pages the on-call engineer immediately
- match:
severity: warning
receiver: slack-engineering
# P2: Slack channel, needs attention today
- match:
severity: info
receiver: email-weekly
# P3: informational, batch into weekly digest
| Severity | Response Time | Channel | Example |
|---|---|---|---|
| P1 / Critical | Minutes | PagerDuty (pages on-call) | Database is down, error rate > 10% |
| P2 / Warning | Hours | Slack channel | Disk at 80%, p99 latency elevated |
| P3 / Info | Days | Email digest | Certificate expiring in 30 days |
Routing rules to live by:
- P1 alerts must be actionable right now. If you can't do anything about it at 3 AM, it's not a P1.
- P2 alerts must be actionable today. If it can wait until next sprint, it's not a P2.
- P3 alerts are FYI. If nobody reads them, delete them.
Alert Fatigue: The Real Enemy
Alert fatigue kills on-call teams. When engineers get 50 alerts per day, they start ignoring all of them — including the real ones. PagerDuty's own research shows that teams with more than 10 alerts per on-call shift have significantly higher incident response times.
Causes of alert fatigue:
- Too many alerts: every metric has an alert. Most are noise.
- Flapping alerts: alert fires, resolves, fires, resolves. Use
forclause. - Non-actionable alerts: "CPU is 85%" — what am I supposed to do about it? Replace it with an actionable alert: "Request latency exceeding SLO."
- Duplicate alerts: same problem triggers 5 different alerts. Use Alertmanager grouping.
- Bad thresholds: set during calm periods, fires constantly under normal load.
Fixes:
- Delete noisy alerts. Seriously. If an alert fires weekly and nobody investigates, delete it. It's worse than useless — it trains the team to ignore alerts.
- SLO-based alerting. Instead of "CPU > 80%", alert on "error budget burn rate too high." This ties alerts to user impact, not infrastructure metrics.
- Weekly alert review. Every week, review which alerts fired. Delete the noisy ones. Tune the thresholds. This is a cultural practice, not a technical one.
Spotify runs a weekly "alert hygiene" meeting where teams review and prune alerts. The result: their on-call engineers get fewer than 5 actionable alerts per week.
SLO-Based Alerting
This is the modern approach. Instead of alerting on symptoms ("CPU high," "memory usage above 90%"), alert on user-facing objectives.
SLO: 99.9% of requests complete under 200ms
Measurement:
error_budget = 1 - 0.999 = 0.001 (0.1% of requests can be slow)
30-day budget = 0.001 × 30 × 24 × 60 = 43.2 minutes of downtime
Alert when:
burn_rate > 14x → will exhaust 30-day budget in ~2 days → P1
burn_rate > 6x → will exhaust 30-day budget in ~5 days → P1
burn_rate > 3x → will exhaust 30-day budget in ~10 days → P2
burn_rate > 1x → will exhaust 30-day budget in 30 days → P3
Google's SRE book (chapter 5) describes this multi-window burn rate alerting. It's the gold standard.
Why this is better: a brief CPU spike doesn't fire an alert unless it actually affects users. And a slow degradation that users barely notice but will eventually exhaust the error budget does fire an alert — before it becomes a problem.
The Three Pillars of Observability
Metrics, traces, and logs are called the three pillars. Each answers a different question.
METRICS: "What is happening?"
→ Error rate is 5%. p99 latency is 2.3 seconds.
→ USE: dashboard, alerting
TRACES: "Where is it happening?"
→ The payment service call to Stripe takes 2.4 seconds.
→ USE: latency debugging, dependency mapping
LOGS: "Why is it happening?"
→ Stripe returned 429 Too Many Requests. Retry limit reached.
→ USE: root cause analysis, audit trail
They're connected via trace IDs:
Alert fires (metric): error rate > 5%
→ Look at Grafana dashboard (metric): error rate spiked at 14:32
→ Find exemplar trace_id from metric: abc123def456
→ Open trace (Jaeger): payment-service span shows 2.4s timeout
→ Search logs (Loki): {trace_id="abc123def456"} → Stripe 429 error
→ Root cause found: Stripe rate limit. Solution: add retry with backoff.
This flow — from alert to metric to trace to log to root cause — is the observability workflow that every on-call engineer should know. It only works when all three pillars share the trace ID.
Patterns for System Design Interviews
Pattern 1: "How do you monitor this system?" "Metrics for alerting (Prometheus + Grafana), traces for debugging latency (OpenTelemetry + Jaeger), structured logs for root cause analysis (Loki). All linked by trace IDs."
Pattern 2: "How do you handle alerting?" "Prometheus evaluates alert rules. Alertmanager routes by severity — P1 pages on-call via PagerDuty, P2 goes to Slack, P3 goes to email. We use SLO-based burn rate alerts to focus on user impact."
Pattern 3: "How do you handle log volume at scale?" "Structured JSON logs. Collected by Fluent Bit, buffered through Kafka, stored in Loki (labels indexed, content in S3). 30-day retention. Full-text search for security-critical logs in Elasticsearch."
Pattern 4: "How do you know when something is wrong?" "We define SLOs (99.9% of requests under 200ms). We measure error budget burn rate. When the burn rate exceeds thresholds, Alertmanager pages the on-call team with the affected service and suggested runbook."
Trade-offs Table
| Dimension | ELK Stack | Grafana Loki | Datadog Logs |
|---|---|---|---|
| Cost | High (storage + ops) | Low (S3 backend) | Very high (per GB ingested) |
| Full-text search | Fast (indexed) | Slow (scans) | Fast (indexed) |
| Operational burden | High (JVM, shards, ILM) | Low (S3 + single binary) | None (SaaS) |
| Ecosystem | Kibana, Beats, Logstash | Grafana (unified with metrics) | Datadog platform |
| Best for | Security, compliance | Cost-sensitive, Grafana users | Teams that buy over build |

Interview Gotchas
Gotcha 1: "Why not just use console.log everywhere?"
Unstructured text is unqueryable at scale. You can't aggregate, filter, or alert on console.log output across 500 pods. Structured logging with proper fields is what enables observability.
Gotcha 2: "How do you handle sensitive data in logs?" Scrub PII (emails, passwords, credit cards) before logs leave the application. Use a log pipeline filter to redact patterns. Never log passwords or API keys, even in debug mode. GDPR and HIPAA require this.
Gotcha 3: "What's the difference between monitoring and alerting?" Monitoring is passive observation (dashboards, metrics). Alerting is active notification (pages, Slack messages). Monitoring tells you something is wrong when you look. Alerting tells you something is wrong whether you're looking or not.
Gotcha 4: "How do you prevent alert fatigue?"
Use for clause to prevent flapping. Route by severity. Delete noisy alerts weekly. Switch to SLO-based alerting. An on-call engineer should get fewer than 5-10 actionable alerts per shift.
Gotcha 5: "Do you need all three observability pillars?" For a monolith, metrics and logs are usually enough. For microservices, you need traces too — otherwise you can't debug cross-service latency. The more distributed your system, the more you need all three pillars working together.