Distributed Tracing for Test Debugging

9 min read

A test fails with "timeout on POST /orders" after 15 seconds. The Order Service logs show it called the User Service. The User Service logs show it called the Auth Service. The Auth Service logs show nothing unusual. Without distributed tracing, you will spend an hour grepping through three log files trying to correlate timestamps, praying that all three services have their clocks synchronised. With it, you open one screen and see every service, every span, every millisecond — laid out as a waterfall from the moment the request entered the system to the moment it timed out. Distributed tracing is not just an operations tool; it is one of the most powerful debugging instruments available to a QA engineer working in a microservices architecture.

What distributed tracing is

A distributed trace represents the complete journey of a single request as it flows through multiple services. Three core concepts underpin everything:

  • Trace: the entire request journey from entry point to final response, identified by a unique trace ID such as d4cda95b-4b72-4c87-b5f7. Every span produced anywhere in the system for this request shares this ID.
  • Span: one unit of work within the trace — a service handling part of the request, a database query, an outbound HTTP call. Each span has a start time, a duration, a status (OK or ERROR), and optional attributes like HTTP method or SQL query text.
  • Trace context: a set of headers passed from service to service so all spans can be linked to the same trace. The W3C standard uses the traceparent header; Zipkin's older B3 format uses X-B3-TraceId. Modern frameworks read and write these headers automatically when instrumented with OpenTelemetry.

How trace context propagates

Every time a service makes a downstream HTTP call, it includes the trace context in the request headers:

# First service generates trace context and passes it downstream:
POST /orders
traceparent: 00-d4cda95b4b724c87b5f7a1e9c2f3d0e1-a2fb4a1d31d9135a-01

# Format: version-traceId-parentSpanId-flags
# Every downstream service reads this header and creates a child span

GET /users/42 (called by Order Service)
traceparent: 00-d4cda95b4b724c87b5f7a1e9c2f3d0e1-b3fc8d2e45a87f9c-01
                              ↑ same trace ID                  ↑ new parent span

The trace ID never changes. The parent span ID changes with each hop, allowing Jaeger to reconstruct the parent-child relationships that produce the waterfall view. If a service doesn't forward these headers — or if they are stripped by a proxy — the trace breaks into disconnected fragments.

OpenTelemetry — the instrumentation standard

OpenTelemetry (OTel) is the CNCF standard for producing traces, metrics, and logs in a vendor-neutral way. For Spring Boot services, a single Java agent instruments all HTTP calls, database queries, and messaging without any code changes:

java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=order-service \
     -Dotel.exporter.otlp.endpoint=http://jaeger:4317 \
     -jar order-service.jar

The agent intercepts HTTP client calls, JDBC queries, and Kafka producer/consumer calls at the bytecode level. Spans are created and exported to your tracing backend automatically. Once instrumented, traces flow to Jaeger, Zipkin, or Datadog without any application code changes — you flip a JVM flag and the traces appear.

Setting up Jaeger in test environments

Add Jaeger's all-in-one image to your docker-compose.test.yml:

jaeger:
  image: jaegertracing/all-in-one:1.52
  ports:
    - "16686:16686"   # Jaeger UI
    - "4317:4317"     # OTLP gRPC receiver
    - "4318:4318"     # OTLP HTTP receiver
  environment:
    COLLECTOR_OTLP_ENABLED: "true"

Configure your services to export to http://jaeger:4317 and query the assembled traces via the Jaeger UI at http://localhost:16686. The all-in-one image bundles the collector, query service, and UI in a single container — no production-grade deployment needed for test environments.

Using traces to debug test failures

Three concrete QA use cases demonstrate the practical value of having traces in your test environment:

1. Finding which service is slow: a test asserts POST /orders responds in under 500ms but consistently takes 1.8 seconds. You open the trace in Jaeger and find the Auth Service span is 1.4 seconds. The bug is in Auth, not Order Service. Without the trace you would have started debugging the service that owns the endpoint — the wrong place entirely.

2. Verifying service interaction chains: a test for a complex checkout flow should touch exactly four services. You query the trace and count spans — only three services appear. Something upstream short-circuited the flow, either returning a cached response or encountering a silent error before the fourth service was called.

3. Diagnosing flaky tests: a test intermittently fails with a timeout. You pull traces from the failed runs only, filtering in Jaeger by error=true. In every failure, the Payment Service span shows a retry annotation — a downstream dependency is occasionally slow and the retry pushes the total call time over the test timeout. The test isn't flaky; the dependency is.

Step 1 of 5

Request enters system

The API Gateway generates a trace ID and span ID. It injects the traceparent header into the downstream request and records its own span.

Asserting on traces in tests

You can go further than visual inspection and write test assertions against trace data. This pattern is most useful for verifying that a complex request flow invokes exactly the right set of services:

@Test
void shouldInvokeAllRequiredServicesForOrderPlacement() {
    String traceId = UUID.randomUUID().toString().replace("-", "");
 
    given()
        .header("traceparent", "00-" + traceId + "-0000000000000001-01")
        .contentType(ContentType.JSON)
        .body(orderRequest)
        .post("/orders");
 
    // Wait for spans to be exported to Jaeger
    await().atMost(10, SECONDS).until(() ->
        jaegerClient.getTrace(traceId).getSpans().size() >= 4
    );
 
    List<String> serviceNames = jaegerClient.getTrace(traceId)
        .getSpans().stream()
        .map(Span::getProcess)
        .map(Process::getServiceName)
        .collect(Collectors.toList());
 
    assertThat(serviceNames).contains(
        "order-service", "user-service", "inventory-service", "payment-service"
    );
}

The test injects a known trace ID via the traceparent header, then queries Jaeger by that ID after the request completes. The await polling is essential — span export is asynchronous, and asserting immediately after the HTTP response returns will find an incomplete trace. Assert on the presence of specific service names rather than an exact span count, since the number of spans can vary with caching and retry behaviour.

⚠️ Common mistakes

  • Not propagating trace context through message queues. When a request triggers a Kafka event, the trace context must be serialised into the message headers and deserialised by the consumer. Without this, asynchronous flows appear as disconnected traces rather than a single end-to-end view.
  • Running the test suite without a tracing backend in CI. If Jaeger isn't in the docker-compose.test.yml, traces are silently dropped and no span data is available when tests fail. Even if you don't assert on traces, having them available in CI dramatically cuts debugging time.
  • Asserting on exact span counts when service logic can vary. A service may create a different number of spans depending on caching, retry behaviour, or feature flags. Assert on the presence of specific service names rather than an exact total count.

🎯 Practice task

  1. Add Jaeger (all-in-one image) to your docker-compose.test.yml. Configure one of your services to export traces to it using the OpenTelemetry Java agent. Start the stack and open the Jaeger UI at http://localhost:16686.
  2. Make a request to your service through a test or curl. Find the trace in Jaeger. Identify the root span (API Gateway or the first service). Count how many child spans exist.
  3. Deliberately slow down a downstream dependency (use WireMock's withFixedDelay). Make the same request again. Find the new trace. Which span is longest? Is it the span you expected?
  4. Write a test that injects a custom trace ID via the traceparent header. After the request completes, query Jaeger for that trace ID using its HTTP API (/api/traces/{traceId}). Assert that spans from at least two services are present.
  5. Look at a recent intermittently-failing test in your suite. Retrieve the trace from a failed run (if available) and from a passing run. Compare the two traces span by span. What differs? Use that observation to form a hypothesis about the root cause.

With tracing and chaos engineering in your toolkit, the next step is putting both to work in a production-like environment — the following lesson covers how to compose a full observability stack in your CI pipeline so every test run produces the signals you need to debug failures fast.

// tip to track lessons you complete and pick up where you left off across devices.