Distributed Tracing — Jaeger, Zipkin, and the Dapper Model
TL;DR
When a request touches ten services and is slow, metrics tell you something is wrong, logs tell you what each service saw, but traces tell you where the latency actually lives — and that's why every serious microservices deployment needs distributed tracing.
The Problem Traces Solve

A user clicks "Place Order." The request hits the API gateway, routes to the order service, which calls the payment service, which calls Stripe, while the order service also calls the inventory service, which queries DynamoDB, and then the notification service sends a confirmation email.
The request takes 3 seconds. Where's the delay?
With metrics alone, you know the order service's p99 is high. But you don't know if this request is slow because of the payment service, the inventory service, or the notification service. You'd have to check each service's metrics individually and guess.
With logs alone, you'd grep through logs across all services for this request's ID — assuming every service logs the same request ID, which they probably don't.
With a trace, you see one picture:
[API Gateway] ─── 5ms ────┐
│
[Order Service] ─ 2850ms ──┤
│ │
├── [Payment Service] ─ 2500ms ───┐
│ └── [Stripe API] ─ 2400ms │ ← THE BOTTLENECK
│ │
├── [Inventory Service] ─ 45ms │
│ └── [DynamoDB] ─ 12ms │
│ │
└── [Notification Service] ─ 200ms │
└── [SES] ─ 180ms │
│
Total: 3100ms. Stripe API is the problem.
That's a trace. A tree of timed operations across services.
Traces, Spans, and Context
Three concepts. That's all you need.
Trace
A trace represents one end-to-end request. It has a unique trace_id (usually a 128-bit UUID). Every service involved in handling that request shares this trace ID.
Span
A span represents one operation within one service. It has:
span_id— unique identifier for this spantrace_id— which trace it belongs toparent_span_id— which span called this one (null for the root span)operation_name— what the span represents ("POST /orders", "query DynamoDB")start_timeandduration— when it started and how long it tooktags— key-value metadata (http.status_code, db.statement, error=true)logs— timestamped events within the span ("retry attempt 2", "cache miss")
{
"trace_id": "abc123def456",
"span_id": "span-001",
"parent_span_id": null,
"operation": "POST /api/orders",
"service": "api-gateway",
"start_time": "2024-01-15T14:32:05.123Z",
"duration_ms": 3100,
"tags": {
"http.method": "POST",
"http.status_code": 201,
"user.id": "user-42"
}
}
Trace Context Propagation
This is the mechanism that ties it all together. When Service A calls Service B, it must pass the trace ID and parent span ID along with the request. Otherwise, Service B creates an orphan span with no connection to the trace.
The W3C Trace Context standard defines two HTTP headers:
traceparent: 00-abc123def456-span001-01
│ │ │ │
version │ sampled flag
trace_id span_id
tracestate: vendor-specific data (optional)
Every HTTP client and server in your stack must:
- Extract the trace context from incoming request headers
- Create a new child span
- Inject the trace context into outgoing request headers
# Simplified — libraries handle this automatically
def handle_request(request):
# Extract parent context from incoming headers
parent_ctx = extract_context(request.headers)
# Create a new span as a child of the parent
with tracer.start_span("process-order", parent=parent_ctx) as span:
span.set_tag("order.id", order_id)
# Call downstream service — inject context into headers
headers = inject_context(span.context)
response = http_client.post(
"http://payment-service/charge",
headers=headers
)
In practice, you never write this code manually. OpenTelemetry SDKs auto-instrument HTTP clients, gRPC, database drivers, and message queues.
The Dapper Model
Google published the Dapper paper in 2010. It established the tracing model that every modern tracing system follows.
Key ideas from Dapper:
1. Sampling. You don't trace every request. Google traced about 1 in 1,024 requests. The overhead of tracing every request (creating spans, serializing data, transmitting to collectors) is too high for production.
2. Low overhead. Dapper added less than 0.01% latency to traced requests. Tracing must be invisible to users.
3. Application-level transparency. Developers don't instrument every function. The tracing library hooks into the RPC framework (Stubby at Google, gRPC in the open-source world) and auto-instruments every cross-service call.
4. Offline analysis. Spans are collected asynchronously. The tracing backend processes and indexes them in bulk. You query traces after the fact, not in real-time.
Spicy opinion: Google's Dapper paper is the single most influential paper in the observability space. Jaeger, Zipkin, AWS X-Ray, and Datadog APM are all Dapper clones with different storage backends. Read the paper — it's 14 pages and surprisingly readable.
Jaeger: The Kubernetes-Native Tracer
Uber built Jaeger in 2015 to debug latency issues in their microservices architecture. It's now a CNCF graduated project, the same maturity level as Kubernetes and Prometheus.
Architecture
┌──────────────────────────────────────────────┐
│ Application (instrumented with OpenTelemetry)│
│ └── Spans sent via OTLP to... │
├──────────────────────────────────────────────┤
│ Jaeger Collector │
│ ├── Receives spans │
│ ├── Validates and indexes │
│ └── Writes to storage backend │
├──────────────────────────────────────────────┤
│ Storage Backend │
│ ├── Cassandra (scalable, battle-tested) │
│ ├── Elasticsearch (full-text search on tags)│
│ ├── Kafka (as buffer before storage) │
│ └── Badger (embedded, for local dev) │
├──────────────────────────────────────────────┤
│ Jaeger Query (UI + API) │
│ └── Search traces by service, operation, │
│ duration, tags │
└──────────────────────────────────────────────┘
Key Jaeger features:
- Adaptive sampling: automatically adjusts sampling rates based on traffic volume per endpoint.
- Service dependency graph: auto-generated from traces, shows which services call which.
- Comparison: compare two traces side-by-side to spot differences.
- Kubernetes-native: Helm charts, operators, and auto-injection via sidecars.
When Jaeger
- Kubernetes environment
- Already using Cassandra or Elasticsearch
- Need CNCF-grade project with active community
- Want to self-host your tracing infrastructure
Zipkin: The Original Open-Source Tracer
Twitter built Zipkin in 2012, inspired directly by the Dapper paper. It's the oldest open-source distributed tracer and the simplest to get started with.
Application ──HTTP/Kafka──▶ Zipkin Collector ──▶ Storage ──▶ Zipkin UI
│
MySQL / Cassandra /
Elasticsearch / In-memory
Zipkin vs Jaeger:
| Feature | Zipkin | Jaeger |
|---|---|---|
| Origin | Twitter (2012) | Uber (2015) |
| Language | Java | Go |
| CNCF status | Not CNCF | Graduated |
| Sampling | Fixed rate | Adaptive |
| Storage | MySQL, Cassandra, ES, in-memory | Cassandra, ES, Kafka, Badger |
| Kubernetes support | Basic | Native (operators, Helm) |
| Complexity | Simpler | More features, more setup |
When Zipkin: smaller deployments, teams that want simplicity, Java-heavy stacks.
When Jaeger: Kubernetes-native, larger scale, need adaptive sampling.
OpenTelemetry: The Standard That Matters
OpenTelemetry (OTel) is the convergence of OpenTracing and OpenCensus — two competing instrumentation standards that merged in 2019. It provides vendor-neutral SDKs for metrics, traces, and logs.
Why OpenTelemetry matters: instrument once, send to any backend.
Application + OpenTelemetry SDK
│
│ OTLP (OpenTelemetry Protocol)
▼
OTel Collector
├── Export to Jaeger
├── Export to Zipkin
├── Export to Datadog
├── Export to Grafana Tempo
└── Export to AWS X-Ray
# Python: auto-instrumentation with OpenTelemetry
# pip install opentelemetry-instrumentation-flask
# pip install opentelemetry-instrumentation-requests
# pip install opentelemetry-exporter-otlp
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup (usually in app startup)
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(provider)
# Auto-instrumentation handles Flask routes, HTTP clients, DB queries
# No manual span creation needed for standard operations
The OTel Collector is a standalone process that receives, processes, and exports telemetry data. It can:
- Batch spans for efficient export
- Add resource attributes (Kubernetes pod name, region)
- Sample (head or tail)
- Convert between formats (Zipkin ↔ Jaeger ↔ OTLP)
- Route to multiple backends simultaneously
Spicy opinion: if you're starting a new project, use OpenTelemetry from day one. Don't start with a vendor SDK (Datadog, New Relic) — you'll be locked in. OTel gives you freedom to switch backends without re-instrumenting your code. The SDKs are production-ready for all major languages.
Sampling Strategies
Tracing every request is expensive. Storage adds up fast — a moderately busy service might generate 100 GB of trace data per day at 100% sampling. You need to sample.
Head-Based Sampling
Decide whether to trace a request at the beginning (the "head").
Request arrives at API Gateway:
random() < 0.01? → Yes: trace this request (set sampled=true)
→ No: don't trace (set sampled=false)
The sampling decision propagates through all services via
the trace context header. All services honor the decision.
Pros: simple, predictable, low overhead. Cons: you might miss rare but interesting requests (errors, slow ones).
Tail-Based Sampling
Decide whether to keep a trace after all spans are collected. You buffer all spans temporarily, then keep only the interesting ones.
All spans collected → Tail sampling evaluator:
- Duration > 2 seconds? → KEEP
- Contains error? → KEEP
- Status code 5xx? → KEEP
- Random 1%? → KEEP
- Otherwise → DROP
Pros: captures every error and slow request, regardless of sampling rate. Cons: requires buffering all spans temporarily (memory/storage cost), adds latency to trace availability, more complex infrastructure.
The OpenTelemetry Collector supports tail-based sampling natively.
Practical Sampling Rates
| Environment | Rate | Reasoning |
|---|---|---|
| Dev/staging | 100% | Low traffic, want to see everything |
| Production (low traffic) | 10-50% | Enough traces to debug issues |
| Production (high traffic) | 0.1-1% | At 100K req/s, 1% = 1,000 traces/s |
| Production (tail-based) | 100% errors, 1% baseline | Best of both worlds |
Google's Dapper traced ~0.1% of requests. Even at Google scale, that's millions of traces per day — more than enough for debugging.
Cost of Tracing
Tracing isn't free. Here's what it actually costs:
Compute overhead: creating spans, serializing them, and sending them to the collector. With OpenTelemetry, this is typically 1-3% CPU overhead at 100% sampling. At 1% sampling, it's negligible.
Network: each span is roughly 200-500 bytes. At 1,000 spans/second, that's 200-500 KB/s. Not a concern for most deployments.
Storage: this is the real cost. A span stored in Elasticsearch or Cassandra takes 1-3 KB (with indexing). At 1 million spans/hour, that's 1-3 GB/hour or 24-72 GB/day. With 30-day retention, you're looking at 1-2 TB.
Grafana Tempo takes a different approach: it stores traces in object storage (S3) and doesn't index them. You find traces by ID (from logs or metrics) rather than searching through all traces. This reduces storage costs 10-100x compared to Elasticsearch-backed solutions.
Connecting Traces to Metrics and Logs
Traces are most powerful when connected to metrics and logs.
Trace-to-Logs
Include the trace_id in every log line. When you find a slow trace, click through to the logs for that trace ID.
{"timestamp": "2024-01-15T14:32:05.123Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span-003",
"message": "Stripe API timeout after 2400ms",
"stripe_request_id": "req_xyz789"}
Metrics-to-Traces
When a metric spikes (p99 latency jumps), drill down to exemplars — specific trace IDs associated with that spike.
Prometheus metric:
http_request_duration_seconds{handler="/orders"} = 3.1
→ exemplar trace_id = "abc123def456"
→ click to open trace in Jaeger
Prometheus supports exemplars since v2.27. Grafana renders them as dots on metric graphs — click a dot, jump to the trace.
Patterns for System Design Interviews
Pattern 1: "How do you debug latency in a microservices system?" "Distributed tracing. Each request gets a trace ID that propagates through all services. We can see the full request lifecycle and identify which service is the bottleneck."
Pattern 2: "How do you handle observability at scale?" "Metrics for alerting (Prometheus), traces for debugging latency (Jaeger/OTel), logs for root cause analysis. Trace IDs link all three. We sample at 1% to control costs."
Pattern 3: "How does trace context propagation work?" "Each service extracts the trace ID from incoming request headers, creates a child span, and injects the trace ID into outgoing requests. The W3C Trace Context standard defines the header format."
Trade-offs Table
| Dimension | Advantage | Disadvantage |
|---|---|---|
| Debugging | See full request path across services | Only sampled requests are visible |
| Overhead | 1-3% CPU at full sampling | Storage grows fast at high volume |
| Head sampling | Simple, predictable | Misses rare events (errors) |
| Tail sampling | Captures all errors/slow requests | Buffering cost, added complexity |
| Auto-instrumentation | No code changes for HTTP/gRPC | Misses custom business logic spans |
| OpenTelemetry | Vendor-neutral, future-proof | Still maturing in some languages |
| Grafana Tempo | Cheap (S3 storage, no indexing) | Can't search traces, only look up by ID |

Interview Gotchas
Gotcha 1: "Do you trace every request?" No. At scale, 1% sampling is typical. For errors and slow requests, use tail-based sampling to capture 100% of interesting traces while dropping routine ones.
Gotcha 2: "What's the difference between tracing and logging?" Logs are per-service, unstructured or semi-structured text. Traces are structured, cross-service, and show the causal relationship between operations. Logs tell you what happened within a service. Traces tell you what happened across services.
Gotcha 3: "What if a service doesn't propagate the trace context?" The trace breaks. Downstream spans become orphans with no parent. This is the #1 operational problem with tracing. Every service, proxy, and queue must propagate headers. API gateways like Envoy do this automatically.
Gotcha 4: "Can tracing replace logging?" No. Traces show the structure and timing of a request. Logs capture detailed context (error messages, stack traces, business logic decisions). You need both. Link them with trace IDs.
Gotcha 5: "What about tracing through message queues?" The producer injects the trace context into the message headers. The consumer extracts it and creates a child span. Kafka and RabbitMQ both support this via OpenTelemetry instrumentations. The resulting trace shows the async gap between produce and consume.