Distributed Tracing — Jaeger, Zipkin, and the Dapper Model

TL;DR

When a request touches ten services and is slow, metrics tell you something is wrong, logs tell you what each service saw, but traces tell you where the latency actually lives — and that's why every serious microservices deployment needs distributed tracing.

The Problem Traces Solve

Distributed Trace

A user clicks "Place Order." The request hits the API gateway, routes to the order service, which calls the payment service, which calls Stripe, while the order service also calls the inventory service, which queries DynamoDB, and then the notification service sends a confirmation email.

The request takes 3 seconds. Where's the delay?

With metrics alone, you know the order service's p99 is high. But you don't know if this request is slow because of the payment service, the inventory service, or the notification service. You'd have to check each service's metrics individually and guess.

With logs alone, you'd grep through logs across all services for this request's ID — assuming every service logs the same request ID, which they probably don't.

With a trace, you see one picture:

[API Gateway] ─── 5ms ────┐
                           │
[Order Service] ─ 2850ms ──┤
    │                      │
    ├── [Payment Service] ─ 2500ms ───┐
    │       └── [Stripe API] ─ 2400ms │  ← THE BOTTLENECK
    │                                  │
    ├── [Inventory Service] ─ 45ms     │
    │       └── [DynamoDB] ─ 12ms      │
    │                                  │
    └── [Notification Service] ─ 200ms │
            └── [SES] ─ 180ms          │
                                       │
Total: 3100ms. Stripe API is the problem.

That's a trace. A tree of timed operations across services.

Traces, Spans, and Context

Three concepts. That's all you need.

Trace

A trace represents one end-to-end request. It has a unique trace_id (usually a 128-bit UUID). Every service involved in handling that request shares this trace ID.

Span

A span represents one operation within one service. It has:

span_id — unique identifier for this span
trace_id — which trace it belongs to
parent_span_id — which span called this one (null for the root span)
operation_name — what the span represents ("POST /orders", "query DynamoDB")
start_time and duration — when it started and how long it took
tags — key-value metadata (http.status_code, db.statement, error=true)
logs — timestamped events within the span ("retry attempt 2", "cache miss")

{
  "trace_id": "abc123def456",
  "span_id": "span-001",
  "parent_span_id": null,
  "operation": "POST /api/orders",
  "service": "api-gateway",
  "start_time": "2024-01-15T14:32:05.123Z",
  "duration_ms": 3100,
  "tags": {
    "http.method": "POST",
    "http.status_code": 201,
    "user.id": "user-42"
  }
}

Trace Context Propagation

This is the mechanism that ties it all together. When Service A calls Service B, it must pass the trace ID and parent span ID along with the request. Otherwise, Service B creates an orphan span with no connection to the trace.

The W3C Trace Context standard defines two HTTP headers:

traceparent: 00-abc123def456-span001-01
             │  │              │       │
             version           │       sampled flag
                    trace_id   span_id

tracestate: vendor-specific data (optional)

Every HTTP client and server in your stack must:

Extract the trace context from incoming request headers
Create a new child span
Inject the trace context into outgoing request headers

# Simplified — libraries handle this automatically
def handle_request(request):
    # Extract parent context from incoming headers
    parent_ctx = extract_context(request.headers)

    # Create a new span as a child of the parent
    with tracer.start_span("process-order", parent=parent_ctx) as span:
        span.set_tag("order.id", order_id)

        # Call downstream service — inject context into headers
        headers = inject_context(span.context)
        response = http_client.post(
            "http://payment-service/charge",
            headers=headers
        )

In practice, you never write this code manually. OpenTelemetry SDKs auto-instrument HTTP clients, gRPC, database drivers, and message queues.

The Dapper Model

Google published the Dapper paper in 2010. It established the tracing model that every modern tracing system follows.

Key ideas from Dapper:

1. Sampling. You don't trace every request. Google traced about 1 in 1,024 requests. The overhead of tracing every request (creating spans, serializing data, transmitting to collectors) is too high for production.

2. Low overhead. Dapper added less than 0.01% latency to traced requests. Tracing must be invisible to users.

3. Application-level transparency. Developers don't instrument every function. The tracing library hooks into the RPC framework (Stubby at Google, gRPC in the open-source world) and auto-instruments every cross-service call.

4. Offline analysis. Spans are collected asynchronously. The tracing backend processes and indexes them in bulk. You query traces after the fact, not in real-time.

Spicy opinion: Google's Dapper paper is the single most influential paper in the observability space. Jaeger, Zipkin, AWS X-Ray, and Datadog APM are all Dapper clones with different storage backends. Read the paper — it's 14 pages and surprisingly readable.

Jaeger: The Kubernetes-Native Tracer

Uber built Jaeger in 2015 to debug latency issues in their microservices architecture. It's now a CNCF graduated project, the same maturity level as Kubernetes and Prometheus.

Architecture

┌──────────────────────────────────────────────┐
│ Application (instrumented with OpenTelemetry)│
│  └── Spans sent via OTLP to...              │
├──────────────────────────────────────────────┤
│ Jaeger Collector                             │
│  ├── Receives spans                          │
│  ├── Validates and indexes                   │
│  └── Writes to storage backend              │
├──────────────────────────────────────────────┤
│ Storage Backend                              │
│  ├── Cassandra (scalable, battle-tested)     │
│  ├── Elasticsearch (full-text search on tags)│
│  ├── Kafka (as buffer before storage)        │
│  └── Badger (embedded, for local dev)        │
├──────────────────────────────────────────────┤
│ Jaeger Query (UI + API)                      │
│  └── Search traces by service, operation,    │
│      duration, tags                          │
└──────────────────────────────────────────────┘

Key Jaeger features:

Adaptive sampling: automatically adjusts sampling rates based on traffic volume per endpoint.
Service dependency graph: auto-generated from traces, shows which services call which.
Comparison: compare two traces side-by-side to spot differences.
Kubernetes-native: Helm charts, operators, and auto-injection via sidecars.

When Jaeger

Kubernetes environment
Already using Cassandra or Elasticsearch
Need CNCF-grade project with active community
Want to self-host your tracing infrastructure

Zipkin: The Original Open-Source Tracer

Twitter built Zipkin in 2012, inspired directly by the Dapper paper. It's the oldest open-source distributed tracer and the simplest to get started with.

Application ──HTTP/Kafka──▶ Zipkin Collector ──▶ Storage ──▶ Zipkin UI
                                                   │
                                          MySQL / Cassandra /
                                          Elasticsearch / In-memory

Zipkin vs Jaeger:

Feature	Zipkin	Jaeger
Origin	Twitter (2012)	Uber (2015)
Language	Java	Go
CNCF status	Not CNCF	Graduated
Sampling	Fixed rate	Adaptive
Storage	MySQL, Cassandra, ES, in-memory	Cassandra, ES, Kafka, Badger
Kubernetes support	Basic	Native (operators, Helm)
Complexity	Simpler	More features, more setup

When Zipkin: smaller deployments, teams that want simplicity, Java-heavy stacks.

When Jaeger: Kubernetes-native, larger scale, need adaptive sampling.

OpenTelemetry: The Standard That Matters

OpenTelemetry (OTel) is the convergence of OpenTracing and OpenCensus — two competing instrumentation standards that merged in 2019. It provides vendor-neutral SDKs for metrics, traces, and logs.

Why OpenTelemetry matters: instrument once, send to any backend.

Application + OpenTelemetry SDK
        │
        │ OTLP (OpenTelemetry Protocol)
        ▼
  OTel Collector
    ├── Export to Jaeger
    ├── Export to Zipkin
    ├── Export to Datadog
    ├── Export to Grafana Tempo
    └── Export to AWS X-Ray

# Python: auto-instrumentation with OpenTelemetry
# pip install opentelemetry-instrumentation-flask
# pip install opentelemetry-instrumentation-requests
# pip install opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup (usually in app startup)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Auto-instrumentation handles Flask routes, HTTP clients, DB queries
# No manual span creation needed for standard operations

The OTel Collector is a standalone process that receives, processes, and exports telemetry data. It can:

Batch spans for efficient export
Add resource attributes (Kubernetes pod name, region)
Sample (head or tail)
Convert between formats (Zipkin ↔ Jaeger ↔ OTLP)
Route to multiple backends simultaneously

Spicy opinion: if you're starting a new project, use OpenTelemetry from day one. Don't start with a vendor SDK (Datadog, New Relic) — you'll be locked in. OTel gives you freedom to switch backends without re-instrumenting your code. The SDKs are production-ready for all major languages.

Sampling Strategies

Tracing every request is expensive. Storage adds up fast — a moderately busy service might generate 100 GB of trace data per day at 100% sampling. You need to sample.

Head-Based Sampling

Decide whether to trace a request at the beginning (the "head").

Request arrives at API Gateway:
  random() < 0.01?  → Yes: trace this request (set sampled=true)
                     → No: don't trace (set sampled=false)

The sampling decision propagates through all services via
the trace context header. All services honor the decision.

Pros: simple, predictable, low overhead. Cons: you might miss rare but interesting requests (errors, slow ones).

Tail-Based Sampling

Decide whether to keep a trace after all spans are collected. You buffer all spans temporarily, then keep only the interesting ones.

All spans collected → Tail sampling evaluator:
  - Duration > 2 seconds? → KEEP
  - Contains error? → KEEP
  - Status code 5xx? → KEEP
  - Random 1%? → KEEP
  - Otherwise → DROP

Pros: captures every error and slow request, regardless of sampling rate. Cons: requires buffering all spans temporarily (memory/storage cost), adds latency to trace availability, more complex infrastructure.

The OpenTelemetry Collector supports tail-based sampling natively.

Practical Sampling Rates

Environment	Rate	Reasoning
Dev/staging	100%	Low traffic, want to see everything
Production (low traffic)	10-50%	Enough traces to debug issues
Production (high traffic)	0.1-1%	At 100K req/s, 1% = 1,000 traces/s
Production (tail-based)	100% errors, 1% baseline	Best of both worlds

Google's Dapper traced ~0.1% of requests. Even at Google scale, that's millions of traces per day — more than enough for debugging.

Cost of Tracing

Tracing isn't free. Here's what it actually costs:

Compute overhead: creating spans, serializing them, and sending them to the collector. With OpenTelemetry, this is typically 1-3% CPU overhead at 100% sampling. At 1% sampling, it's negligible.

Network: each span is roughly 200-500 bytes. At 1,000 spans/second, that's 200-500 KB/s. Not a concern for most deployments.

Storage: this is the real cost. A span stored in Elasticsearch or Cassandra takes 1-3 KB (with indexing). At 1 million spans/hour, that's 1-3 GB/hour or 24-72 GB/day. With 30-day retention, you're looking at 1-2 TB.

Grafana Tempo takes a different approach: it stores traces in object storage (S3) and doesn't index them. You find traces by ID (from logs or metrics) rather than searching through all traces. This reduces storage costs 10-100x compared to Elasticsearch-backed solutions.

Connecting Traces to Metrics and Logs

Traces are most powerful when connected to metrics and logs.

Trace-to-Logs

Include the trace_id in every log line. When you find a slow trace, click through to the logs for that trace ID.

{"timestamp": "2024-01-15T14:32:05.123Z",
 "level": "error",
 "service": "payment-service",
 "trace_id": "abc123def456",
 "span_id": "span-003",
 "message": "Stripe API timeout after 2400ms",
 "stripe_request_id": "req_xyz789"}

Metrics-to-Traces

When a metric spikes (p99 latency jumps), drill down to exemplars — specific trace IDs associated with that spike.

Prometheus metric:
  http_request_duration_seconds{handler="/orders"} = 3.1
  → exemplar trace_id = "abc123def456"
  → click to open trace in Jaeger

Prometheus supports exemplars since v2.27. Grafana renders them as dots on metric graphs — click a dot, jump to the trace.

Patterns for System Design Interviews

Pattern 1: "How do you debug latency in a microservices system?" "Distributed tracing. Each request gets a trace ID that propagates through all services. We can see the full request lifecycle and identify which service is the bottleneck."

Pattern 2: "How do you handle observability at scale?" "Metrics for alerting (Prometheus), traces for debugging latency (Jaeger/OTel), logs for root cause analysis. Trace IDs link all three. We sample at 1% to control costs."

Pattern 3: "How does trace context propagation work?" "Each service extracts the trace ID from incoming request headers, creates a child span, and injects the trace ID into outgoing requests. The W3C Trace Context standard defines the header format."

Trade-offs Table

Dimension	Advantage	Disadvantage
Debugging	See full request path across services	Only sampled requests are visible
Overhead	1-3% CPU at full sampling	Storage grows fast at high volume
Head sampling	Simple, predictable	Misses rare events (errors)
Tail sampling	Captures all errors/slow requests	Buffering cost, added complexity
Auto-instrumentation	No code changes for HTTP/gRPC	Misses custom business logic spans
OpenTelemetry	Vendor-neutral, future-proof	Still maturing in some languages
Grafana Tempo	Cheap (S3 storage, no indexing)	Can't search traces, only look up by ID

OpenTelemetry Collector

Interview Gotchas

Gotcha 1: "Do you trace every request?" No. At scale, 1% sampling is typical. For errors and slow requests, use tail-based sampling to capture 100% of interesting traces while dropping routine ones.

Gotcha 2: "What's the difference between tracing and logging?" Logs are per-service, unstructured or semi-structured text. Traces are structured, cross-service, and show the causal relationship between operations. Logs tell you what happened within a service. Traces tell you what happened across services.

Gotcha 3: "What if a service doesn't propagate the trace context?" The trace breaks. Downstream spans become orphans with no parent. This is the #1 operational problem with tracing. Every service, proxy, and queue must propagate headers. API gateways like Envoy do this automatically.

Gotcha 4: "Can tracing replace logging?" No. Traces show the structure and timing of a request. Logs capture detailed context (error messages, stack traces, business logic decisions). You need both. Link them with trace IDs.

Gotcha 5: "What about tracing through message queues?" The producer injects the trace context into the message headers. The consumer extracts it and creates a child span. Kafka and RabbitMQ both support this via OpenTelemetry instrumentations. The resulting trace shows the async gap between produce and consume.