Distributed Tracing Internals — The Dapper Model
TL;DR
Distributed tracing tracks a request as it flows through multiple services. Google invented Dapper (2010): a trace is a tree of spans, context propagates via HTTP headers, spans are collected asynchronously, and only 1% of traces are sampled to keep overhead manageable. The W3C traceparent header is the standard for context propagation. Head-based sampling decides at the start; tail-based sampling decides after the trace is complete (capturing rare errors but requiring more infrastructure). OpenTelemetry has converged the ecosystem. Jaeger and Zipkin are the major open-source implementations. The cost equation: 100% tracing is too expensive, 1% misses rare bugs, tail-based sampling is the best compromise.
The Problem
A user's request to "load the home page" touches 15 microservices: API gateway, auth, user profile, recommendations, content, ads, A/B testing, feature flags, search, cache, three databases, a message queue, and a CDN. The request takes 3 seconds instead of the expected 200ms. Which service is slow?
Without tracing: check each service's logs individually. Correlate timestamps (which are unreliable across services). Hope that someone logged a request ID. Spend 2 hours finding that the recommendations service made a slow database query.
With tracing: look up the trace by request ID. See a tree of spans showing exactly which service took how long. Find the 2.5-second span in the recommendations service's database call in 30 seconds.
This is not a luxury. It is a prerequisite for operating a microservices architecture. Without distributed tracing, debugging production issues is guesswork.
The Algorithm: The Dapper Model
Core Concepts
Trace: A single end-to-end request. Identified by a unique trace_id (typically a 128-bit random ID).
Span: A single unit of work within a trace. Each span has:
| Field | Purpose |
|---|---|
trace_id |
Links this span to its trace |
span_id |
Unique identifier for this span |
parent_span_id |
ID of the parent span (null for root span) |
operation_name |
What this span represents (e.g., "GET /api/users") |
start_time |
When the span started (microsecond precision) |
duration |
How long the span took |
tags |
Key-value metadata (e.g., http.status_code=200) |
logs |
Timestamped events within the span |
Trace tree: Spans form a tree. The root span is the initial request. Each downstream service call creates a child span.
Worked Example
User → API Gateway → Auth Service → User Service → Database
Trace ID: abc123
Span 1: API Gateway (root)
span_id: s1, parent: null
operation: "GET /home"
start: 0ms, duration: 250ms
Span 2: Auth Service (child of s1)
span_id: s2, parent: s1
operation: "verify_token"
start: 5ms, duration: 15ms
Span 3: User Service (child of s1)
span_id: s3, parent: s1
operation: "get_user_profile"
start: 25ms, duration: 200ms
Span 4: Database (child of s3)
span_id: s4, parent: s3
operation: "SELECT * FROM users WHERE id = ?"
start: 30ms, duration: 180ms ← bottleneck!
The trace tree immediately reveals that the database query in the User Service is consuming 180ms out of the 250ms total. Without tracing, you would only know "the home page is slow."
Context Propagation
For spans to form a tree, each service must pass the trace context to the next service. This happens via HTTP headers.
W3C Trace Context standard (traceparent header):
traceparent: 00-abc123def456-span789-01
Format: version-trace_id-parent_span_id-flags
00: version
abc123def456: trace_id (32 hex chars = 128 bits)
span789: parent_span_id (16 hex chars = 64 bits)
01: flags (01 = sampled)
When Service A calls Service B:
Service A:
1. Create span for the outgoing call.
2. Set traceparent header: "00-{trace_id}-{new_span_id}-01"
3. Send HTTP request to Service B with the header.
Service B:
1. Extract traceparent from incoming request.
2. Create a new span with parent_span_id = extracted span_id.
3. Process the request.
4. When calling Service C, propagate the same trace_id with a new span_id.
The tracestate header: Vendor-specific key-value pairs. Allows multiple tracing systems to coexist without losing each other's context.
For non-HTTP protocols (gRPC, message queues), the trace context is propagated in metadata fields or message headers. The concept is the same: pass trace_id and parent_span_id through every inter-service boundary.
Span Collection
Spans are collected asynchronously to minimize performance impact:
Service generates span
→ writes to in-memory buffer (batch)
→ periodically flushes to collector agent (local sidecar or daemon)
→ agent forwards to central collector (Jaeger Collector, Zipkin)
→ collector writes to storage (Elasticsearch, Cassandra, ClickHouse)
→ query frontend reads from storage for UI
This asynchronous pipeline means: - Span creation adds ~1-5 microseconds of latency to the request (just setting fields). - Span export happens in a background thread. - If the collector is down, spans are dropped (not the request).
Sampling
The Cost Problem
At 100,000 requests per second, with an average of 10 spans per request, you generate 1 million spans per second. Each span is ~500 bytes with tags and logs. That is 500 MB/s of trace data, or 43 TB/day.
Storing and indexing 43 TB/day of trace data is expensive. And most traces are "normal" -- you do not need them for debugging. You need the interesting traces: errors, slow requests, anomalies.
Head-Based Sampling
Decide whether to sample a trace at the very beginning (the "head"), before any spans are created.
At the entry point (API Gateway):
if random() < sampling_rate: # e.g., 0.01 for 1%
flags = SAMPLED (01)
else:
flags = NOT_SAMPLED (00)
Set traceparent with these flags.
All downstream services respect the flags.
Pros: Simple. Low overhead. All spans in a sampled trace are collected (complete trace).
Cons: Blind. 1% sampling means you miss 99% of traces. A rare error occurring in 0.01% of requests still has only a 1% chance of being captured per occurrence (the sampling rate). At 100K requests/sec, that is 10 errors/sec, and with 1% sampling you capture about 1 error trace every 10 seconds -- workable at high QPS, but you may miss the first occurrence of a novel bug entirely.
Tail-Based Sampling
Collect ALL spans, buffer them, and decide which traces to keep AFTER the trace is complete.
1. All services send all spans to the collector.
2. Collector buffers spans grouped by trace_id.
3. After a trace is "complete" (no new spans for X seconds):
- If any span has an error tag: keep the trace.
- If total duration > threshold: keep the trace.
- If a specific tag matches (e.g., user_id = VIP): keep the trace.
- Otherwise: drop the trace (or keep 1%).
Pros: Captures ALL interesting traces (errors, slow requests). Sampling decision is informed by the actual trace data.
Cons: Must collect and buffer ALL spans before deciding. Higher network and compute costs. Requires a stateful collector that can group spans by trace_id (which needs consistent routing or shared state).
Practical Sampling Strategies
| Strategy | Implementation | Use Case |
|---|---|---|
| Fixed-rate head sampling | 1% of traces | Default, low-cost |
| Adaptive sampling | Sample more of rare endpoints | Balanced coverage |
| Error-driven tail sampling | Keep all traces with errors | Error debugging |
| Latency-driven tail sampling | Keep traces above p99 latency | Performance debugging |
| Debug override | Force sampling via header | On-demand debugging |
The debug override is worth highlighting. If a developer is debugging an issue, they can add a special header (e.g., X-Debug-Trace: true) that forces 100% sampling for their requests. This enables targeted debugging without increasing the sampling rate globally.
Proof/Correctness Intuition
Why Tree Structure Works
A microservice request naturally forms a tree: the initial request is the root, each downstream call is a child, and recursive calls create deeper levels. The parent-child relationship is established by context propagation: the child's parent_span_id points to the parent's span_id.
This tree structure enables: - Waterfall visualization: shows the timeline of all spans, making it obvious which span is the bottleneck. - Critical path analysis: the longest path from root to leaf is the critical path. Optimizing any span not on the critical path does not reduce total latency.
Why Sampling Is Necessary
The alternative to sampling is 100% collection. For a large organization (Google, Uber, Netflix), this means tens of terabytes of trace data per day. The storage cost alone is significant, but the real problem is the query cost: searching through billions of spans to find one trace is slow.
Sampling reduces the data volume to a manageable level. The statistical argument: 1% sampling means any pattern that occurs in more than 100 requests per second will be captured in at least one trace per second. For most debugging purposes, this is sufficient.
Real-World Usage
| System | Based On | Key Feature |
|---|---|---|
| Jaeger | Dapper model | CNCF project, Kubernetes-native |
| Zipkin | Dapper model | Twitter origin, mature ecosystem |
| AWS X-Ray | Dapper model | AWS-integrated, sampling built-in |
| Google Cloud Trace | Dapper | Google Cloud, low-overhead |
| Datadog APM | Dapper + proprietary | Commercial, tail-based sampling |
| OpenTelemetry | Convergence standard | Vendor-neutral, replaces OpenTracing + OpenCensus |
OpenTelemetry is the convergence point. It merges OpenTracing (tracing API) and OpenCensus (tracing + metrics) into a single standard. New projects should use OpenTelemetry. It provides:
- Language SDKs for instrumentation (Go, Java, Python, JS, etc.)
- OTLP (OpenTelemetry Protocol) for span export
- Collector for span processing and routing
- Auto-instrumentation for common frameworks
Interview Application
When to mention distributed tracing:
- "How would you debug a slow request in a microservices architecture?" -- Distributed tracing with trace ID correlation.
- "How do you monitor inter-service latency?" -- Trace spans with timing data, aggregate for p50/p99 per edge.
- "How would you implement end-to-end visibility?" -- OpenTelemetry for traces, propagate context via W3C headers.
- "How do you handle the cost of tracing?" -- Head-based sampling (1%) for cost control, tail-based for error capture.
What interviewers want to hear:
- You understand the trace/span/parent model.
- You know context propagation via headers (W3C traceparent).
- You can discuss sampling trade-offs (head vs tail, cost vs coverage).
- You know OpenTelemetry is the current standard.
- You understand that tracing is asynchronous and does not add significant latency to requests.
Trade-offs

| Advantage | Disadvantage |
|---|---|
| Pinpoints latency bottlenecks | Sampling misses rare events |
| Shows inter-service dependencies | Storage and query costs at scale |
| Enables root cause analysis | Requires instrumentation in every service |
| Context propagation is standardized | Async collection means traces may be incomplete |
| Can be combined with metrics + logs | Tail-based sampling requires stateful collector |
The Three Pillars: Traces, Metrics, Logs
Distributed tracing is one of three observability signals:
- Traces: Request-level, shows the path through services. "What happened to this specific request?"
- Metrics: Aggregated, shows system-level health. "What is the p99 latency of the auth service?"
- Logs: Event-level, shows detailed context. "What was the exact error message?"
The strongest setup correlates all three: a trace ID in log messages links logs to traces. Span metadata feeds into metrics dashboards. OpenTelemetry provides all three signals through a unified SDK.
Common Mistakes
Distributed tracing replaces logging
Traces show structure (which service called which, how long). Logs show content (error messages, request bodies, stack traces). You need both. The best practice is to include the trace ID in log messages so you can jump from a trace to the relevant logs.
100% sampling is necessary to catch all errors
100% sampling catches all errors but generates massive data volumes. Tail-based sampling catches all errors with much lower storage cost. If tail-based is too complex, head-based sampling at 1% combined with error-rate alerting from metrics covers most needs.
Tracing adds significant latency to requests
Span creation is a few microseconds (set a struct's fields). Span export happens asynchronously in a background thread. The W3C traceparent header adds ~80 bytes to each HTTP request. Total overhead is typically < 0.1% of request latency.
Just use request IDs instead of tracing
Request IDs (a single ID passed through all services) enable log correlation but not latency analysis. Without spans with start/end times and parent-child relationships, you cannot build a waterfall view or identify the critical path. Tracing is a superset of request ID correlation.
Tracing works automatically in all architectures
Tracing requires context propagation through every inter-service boundary. If a service does not propagate the traceparent header (because it uses a custom HTTP client, or communicates via a message queue without propagation support), the trace is broken into disconnected fragments.
Sampling rate should be the same for all services
Critical-path services (auth, payment) should have higher sampling rates than background services (analytics, batch processing). Adaptive sampling adjusts the rate per endpoint based on traffic volume and error rate.