Distributed Tracing Internals — The Dapper Model

TL;DR

Distributed tracing tracks a request as it flows through multiple services. Google invented Dapper (2010): a trace is a tree of spans, context propagates via HTTP headers, spans are collected asynchronously, and only 1% of traces are sampled to keep overhead manageable. The W3C traceparent header is the standard for context propagation. Head-based sampling decides at the start; tail-based sampling decides after the trace is complete (capturing rare errors but requiring more infrastructure). OpenTelemetry has converged the ecosystem. Jaeger and Zipkin are the major open-source implementations. The cost equation: 100% tracing is too expensive, 1% misses rare bugs, tail-based sampling is the best compromise.

The Problem

A user's request to "load the home page" touches 15 microservices: API gateway, auth, user profile, recommendations, content, ads, A/B testing, feature flags, search, cache, three databases, a message queue, and a CDN. The request takes 3 seconds instead of the expected 200ms. Which service is slow?

Without tracing: check each service's logs individually. Correlate timestamps (which are unreliable across services). Hope that someone logged a request ID. Spend 2 hours finding that the recommendations service made a slow database query.

With tracing: look up the trace by request ID. See a tree of spans showing exactly which service took how long. Find the 2.5-second span in the recommendations service's database call in 30 seconds.

This is not a luxury. It is a prerequisite for operating a microservices architecture. Without distributed tracing, debugging production issues is guesswork.

The Algorithm: The Dapper Model

Core Concepts

Trace: A single end-to-end request. Identified by a unique trace_id (typically a 128-bit random ID).

Span: A single unit of work within a trace. Each span has:

Field	Purpose
`trace_id`	Links this span to its trace
`span_id`	Unique identifier for this span
`parent_span_id`	ID of the parent span (null for root span)
`operation_name`	What this span represents (e.g., "GET /api/users")
`start_time`	When the span started (microsecond precision)
`duration`	How long the span took
`tags`	Key-value metadata (e.g., http.status_code=200)
`logs`	Timestamped events within the span

Trace tree: Spans form a tree. The root span is the initial request. Each downstream service call creates a child span.

Worked Example

User → API Gateway → Auth Service → User Service → Database

Trace ID: abc123

Span 1: API Gateway (root)
  span_id: s1, parent: null
  operation: "GET /home"
  start: 0ms, duration: 250ms

  Span 2: Auth Service (child of s1)
    span_id: s2, parent: s1
    operation: "verify_token"
    start: 5ms, duration: 15ms

  Span 3: User Service (child of s1)
    span_id: s3, parent: s1
    operation: "get_user_profile"
    start: 25ms, duration: 200ms

    Span 4: Database (child of s3)
      span_id: s4, parent: s3
      operation: "SELECT * FROM users WHERE id = ?"
      start: 30ms, duration: 180ms   ← bottleneck!

The trace tree immediately reveals that the database query in the User Service is consuming 180ms out of the 250ms total. Without tracing, you would only know "the home page is slow."

Context Propagation

For spans to form a tree, each service must pass the trace context to the next service. This happens via HTTP headers.

W3C Trace Context standard (traceparent header):

traceparent: 00-abc123def456-span789-01

Format: version-trace_id-parent_span_id-flags
  00:          version
  abc123def456: trace_id (32 hex chars = 128 bits)
  span789:     parent_span_id (16 hex chars = 64 bits)
  01:          flags (01 = sampled)

When Service A calls Service B:

Service A:
  1. Create span for the outgoing call.
  2. Set traceparent header: "00-{trace_id}-{new_span_id}-01"
  3. Send HTTP request to Service B with the header.

Service B:
  1. Extract traceparent from incoming request.
  2. Create a new span with parent_span_id = extracted span_id.
  3. Process the request.
  4. When calling Service C, propagate the same trace_id with a new span_id.

The tracestate header: Vendor-specific key-value pairs. Allows multiple tracing systems to coexist without losing each other's context.

For non-HTTP protocols (gRPC, message queues), the trace context is propagated in metadata fields or message headers. The concept is the same: pass trace_id and parent_span_id through every inter-service boundary.

Span Collection

Spans are collected asynchronously to minimize performance impact:

Service generates span
  → writes to in-memory buffer (batch)
  → periodically flushes to collector agent (local sidecar or daemon)
  → agent forwards to central collector (Jaeger Collector, Zipkin)
  → collector writes to storage (Elasticsearch, Cassandra, ClickHouse)
  → query frontend reads from storage for UI

This asynchronous pipeline means: - Span creation adds ~1-5 microseconds of latency to the request (just setting fields). - Span export happens in a background thread. - If the collector is down, spans are dropped (not the request).

Sampling

The Cost Problem

At 100,000 requests per second, with an average of 10 spans per request, you generate 1 million spans per second. Each span is ~500 bytes with tags and logs. That is 500 MB/s of trace data, or 43 TB/day.

Storing and indexing 43 TB/day of trace data is expensive. And most traces are "normal" -- you do not need them for debugging. You need the interesting traces: errors, slow requests, anomalies.

Head-Based Sampling

Decide whether to sample a trace at the very beginning (the "head"), before any spans are created.

At the entry point (API Gateway):
  if random() < sampling_rate:   # e.g., 0.01 for 1%
    flags = SAMPLED (01)
  else:
    flags = NOT_SAMPLED (00)

  Set traceparent with these flags.
  All downstream services respect the flags.

Pros: Simple. Low overhead. All spans in a sampled trace are collected (complete trace).

Cons: Blind. 1% sampling means you miss 99% of traces. A rare error occurring in 0.01% of requests still has only a 1% chance of being captured per occurrence (the sampling rate). At 100K requests/sec, that is 10 errors/sec, and with 1% sampling you capture about 1 error trace every 10 seconds -- workable at high QPS, but you may miss the first occurrence of a novel bug entirely.

Tail-Based Sampling

Collect ALL spans, buffer them, and decide which traces to keep AFTER the trace is complete.

1. All services send all spans to the collector.
2. Collector buffers spans grouped by trace_id.
3. After a trace is "complete" (no new spans for X seconds):
   - If any span has an error tag: keep the trace.
   - If total duration > threshold: keep the trace.
   - If a specific tag matches (e.g., user_id = VIP): keep the trace.
   - Otherwise: drop the trace (or keep 1%).

Pros: Captures ALL interesting traces (errors, slow requests). Sampling decision is informed by the actual trace data.

Cons: Must collect and buffer ALL spans before deciding. Higher network and compute costs. Requires a stateful collector that can group spans by trace_id (which needs consistent routing or shared state).

Practical Sampling Strategies

Strategy	Implementation	Use Case
Fixed-rate head sampling	1% of traces	Default, low-cost
Adaptive sampling	Sample more of rare endpoints	Balanced coverage
Error-driven tail sampling	Keep all traces with errors	Error debugging
Latency-driven tail sampling	Keep traces above p99 latency	Performance debugging
Debug override	Force sampling via header	On-demand debugging

The debug override is worth highlighting. If a developer is debugging an issue, they can add a special header (e.g., X-Debug-Trace: true) that forces 100% sampling for their requests. This enables targeted debugging without increasing the sampling rate globally.

Proof/Correctness Intuition

Why Tree Structure Works

A microservice request naturally forms a tree: the initial request is the root, each downstream call is a child, and recursive calls create deeper levels. The parent-child relationship is established by context propagation: the child's parent_span_id points to the parent's span_id.

This tree structure enables: - Waterfall visualization: shows the timeline of all spans, making it obvious which span is the bottleneck. - Critical path analysis: the longest path from root to leaf is the critical path. Optimizing any span not on the critical path does not reduce total latency.

Why Sampling Is Necessary

The alternative to sampling is 100% collection. For a large organization (Google, Uber, Netflix), this means tens of terabytes of trace data per day. The storage cost alone is significant, but the real problem is the query cost: searching through billions of spans to find one trace is slow.

Sampling reduces the data volume to a manageable level. The statistical argument: 1% sampling means any pattern that occurs in more than 100 requests per second will be captured in at least one trace per second. For most debugging purposes, this is sufficient.

Real-World Usage

System	Based On	Key Feature
Jaeger	Dapper model	CNCF project, Kubernetes-native
Zipkin	Dapper model	Twitter origin, mature ecosystem
AWS X-Ray	Dapper model	AWS-integrated, sampling built-in
Google Cloud Trace	Dapper	Google Cloud, low-overhead
Datadog APM	Dapper + proprietary	Commercial, tail-based sampling
OpenTelemetry	Convergence standard	Vendor-neutral, replaces OpenTracing + OpenCensus

OpenTelemetry is the convergence point. It merges OpenTracing (tracing API) and OpenCensus (tracing + metrics) into a single standard. New projects should use OpenTelemetry. It provides:

Language SDKs for instrumentation (Go, Java, Python, JS, etc.)
OTLP (OpenTelemetry Protocol) for span export
Collector for span processing and routing
Auto-instrumentation for common frameworks

Interview Application

When to mention distributed tracing:

"How would you debug a slow request in a microservices architecture?" -- Distributed tracing with trace ID correlation.
"How do you monitor inter-service latency?" -- Trace spans with timing data, aggregate for p50/p99 per edge.
"How would you implement end-to-end visibility?" -- OpenTelemetry for traces, propagate context via W3C headers.
"How do you handle the cost of tracing?" -- Head-based sampling (1%) for cost control, tail-based for error capture.

What interviewers want to hear:

You understand the trace/span/parent model.
You know context propagation via headers (W3C traceparent).
You can discuss sampling trade-offs (head vs tail, cost vs coverage).
You know OpenTelemetry is the current standard.
You understand that tracing is asynchronous and does not add significant latency to requests.

Trade-offs

Traceparent

Advantage	Disadvantage
Pinpoints latency bottlenecks	Sampling misses rare events
Shows inter-service dependencies	Storage and query costs at scale
Enables root cause analysis	Requires instrumentation in every service
Context propagation is standardized	Async collection means traces may be incomplete
Can be combined with metrics + logs	Tail-based sampling requires stateful collector

The Three Pillars: Traces, Metrics, Logs

Distributed tracing is one of three observability signals:

Traces: Request-level, shows the path through services. "What happened to this specific request?"
Metrics: Aggregated, shows system-level health. "What is the p99 latency of the auth service?"
Logs: Event-level, shows detailed context. "What was the exact error message?"

The strongest setup correlates all three: a trace ID in log messages links logs to traces. Span metadata feeds into metrics dashboards. OpenTelemetry provides all three signals through a unified SDK.

Common Mistakes

Distributed tracing replaces logging

Traces show structure (which service called which, how long). Logs show content (error messages, request bodies, stack traces). You need both. The best practice is to include the trace ID in log messages so you can jump from a trace to the relevant logs.

100% sampling is necessary to catch all errors

100% sampling catches all errors but generates massive data volumes. Tail-based sampling catches all errors with much lower storage cost. If tail-based is too complex, head-based sampling at 1% combined with error-rate alerting from metrics covers most needs.

Tracing adds significant latency to requests

Span creation is a few microseconds (set a struct's fields). Span export happens asynchronously in a background thread. The W3C traceparent header adds ~80 bytes to each HTTP request. Total overhead is typically < 0.1% of request latency.

Just use request IDs instead of tracing

Request IDs (a single ID passed through all services) enable log correlation but not latency analysis. Without spans with start/end times and parent-child relationships, you cannot build a waterfall view or identify the critical path. Tracing is a superset of request ID correlation.

Tracing works automatically in all architectures

Tracing requires context propagation through every inter-service boundary. If a service does not propagate the traceparent header (because it uses a custom HTTP client, or communicates via a message queue without propagation support), the trace is broken into disconnected fragments.

Sampling rate should be the same for all services

Critical-path services (auth, payment) should have higher sampling rates than background services (analytics, batch processing). Adaptive sampling adjusts the rate per endpoint based on traffic volume and error rate.