Testing Challenges — Distributed State, Network Failures, Versioning

9 min read

It's your first week on a new team. You inherit a microservices e-commerce test suite — 300 tests, CI is green, all passing. By end of week you've seen three production incidents that the test suite never caught: an order that charged the customer but created no shipment record, a product listed as in-stock that had already sold out, and a payment failure that returned a success response to the frontend. The suite is green because it tests each service in isolation. Production fails because the services run together. This is the central problem of microservices testing: the unit of deployment is the service, but the unit of failure is the system.

Challenge 1: Distributed state

Placing an order in a typical e-commerce system touches four services in sequence — User Service validates the customer, Product Service checks inventory, Order Service creates the record, and Payment Service charges the card. Each service has its own database. None of them share a data store with the others.

This creates an assertion problem. When your test calls POST /orders and gets back a 201, what does "order placed successfully" actually mean? It means the Order Service created a row. It does not mean the Product Service decremented inventory. It does not mean the Payment Service received the charge request. It does not mean the User Service updated the customer's order history.

A test that only asserts on the Order Service response is testing one quarter of the operation. The other three quarters can be broken invisibly.

Pattern: Assert across all affected services using their own APIs — never by querying another service's database directly. After POST /orders succeeds, call GET /users/{id}/orders, GET /products/{id}/stock, and GET /payments/{orderId} to confirm all four services reached the expected state. Because microservices don't update atomically, build polling assertions that retry until consistency is confirmed or a timeout expires.

Challenge 2: Network failures and partial responses

Services communicate over real networks. 503s, timeouts, connection drops, and slow responses are not edge cases — they are normal operating conditions in a distributed system. Every service that calls another service must handle them correctly: retry transient failures, apply circuit breakers when a downstream is persistently down, and return meaningful fallbacks to the caller.

The problem: unit tests mock all network calls. A mocked HTTP client never times out, never drops a connection, never returns a 503 mid-response. Your circuit breaker, retry logic, and fallback code paths may have never executed in a test at all, even if the suite is green.

Pattern: Use WireMock to inject specific failure conditions in component and integration tests. Simulate a 503 from the Payment Service and assert the Order Service activates its circuit breaker and returns a useful error to the caller. Simulate a 5-second response delay and verify the timeout fires and the retry sequence executes. Simulate a connection drop mid-response and confirm the caller doesn't hang indefinitely. These paths need explicit test coverage because mocks will never reveal they're broken.

Challenge 3: Independent deployment and versioning

Team A deploys Order Service v2 on Tuesday. They renamed a JSON field — payment_id became paymentReference — because their internal style guide changed. Team B's Payment Service still sends payment_id. Both services have green test suites in CI. Neither team knew the other's API contract.

On Wednesday morning, every order that requires payment processing fails silently. The field name mismatch means the Order Service receives a null reference it doesn't validate, and the error appears nowhere in either service's own logs.

This is not a careless mistake — it is a structural property of independent deployment. There is no single deployment gate where both services are tested together before either goes live.

Pattern: Consumer-driven contract tests with Pact. The Payment Service (consumer) publishes a contract that says "when I call Order Service, I send a body with payment_id and expect a response with these specific fields." The Order Service (provider) verifies that contract in its own CI pipeline before any deployment. If Team A's rename breaks the contract, the Order Service CI fails before the deployment reaches staging. The contract lives in the consumer's repo and becomes a gate on the provider's deploy.

Challenge 4: Test data across multiple services

A test that verifies checkout needs three pieces of state before it can run: a valid user account, a product in the Product Service catalogue, and sufficient inventory. These live in three separate services with three separate databases.

The naive solution is to insert rows directly into each service's database — create the user row in the users table, the product row in the products table, and the inventory row in the inventory table. This works until the schema changes, the database is migrated, or the service adds validation logic that runs only through the API. Direct DB inserts skip every business rule the service enforces, which means your test data can represent states that are impossible in production.

Pattern: Test data builders that create prerequisite state by calling each service's own API. A builder calls POST /users to create the test user, POST /products to create the test product, and POST /inventory to set the stock level. The test then runs against data that was created through the same code paths production uses. Teardown calls the corresponding delete endpoints or uses a dedicated test-data cleanup endpoint if the service exposes one.

Challenge 5: Asynchronous communication

A user taps "place order." The Order Service publishes a OrderCreated event to Kafka. The Payment Service is a Kafka consumer — it reads that event, processes the payment, and publishes a PaymentProcessed event. The Order Service listens for that second event and updates the order status to confirmed.

Your test calls POST /orders, gets a 202 Accepted, and immediately calls GET /orders/{id} to check the status. The status is PENDING because the Payment Service consumer hasn't processed the event yet. The test fails. You add a Thread.sleep(2000). The test sometimes passes in CI. On a slow Friday afternoon it fails again because the consumer was two seconds behind. You increase the sleep to five seconds. Now your 10-test async suite takes 50 seconds of pure waiting time.

Pattern: Polling assertions with explicit timeouts using Awaitility or an equivalent. Define the condition you're waiting for — order status == CONFIRMED — and the maximum time to wait. The library polls until the condition is true or the timeout expires. The test is fast when the system is fast and fails clearly when a genuine bug prevents the condition from being met. Never use Thread.sleep for asynchronous assertions.

Challenge 6: Test environment cost and complexity

Running 15 services together is expensive. A complete environment with all dependencies means 15 service containers, 4-6 databases, a message broker, a cache layer, and an API gateway. Multiply that by the environments your team needs — developer local, per-PR integration, shared staging, performance — and the infrastructure bill and maintenance overhead become a real constraint on how much testing you can afford to run.

Pattern: Match environment complexity to the test layer that needs it. Developers run Docker Compose locally with a subset of services relevant to their current work. Per-PR ephemeral environments spin up automatically when a pull request opens and tear down when it closes — these run integration tests against a fresh, isolated stack. A shared staging environment exists for exploratory testing and manual QA. A separate performance environment handles load testing with production-like data volumes. E2E tests run only against the shared staging environment — that is the only layer that actually needs all 15 services running simultaneously.

⚠️ Common mistakes

  • Asserting only on the service that received the request. A green assertion on the Order Service response proves the Order Service is working. It says nothing about whether the downstream services that were triggered by that request did their jobs. Assertions in microservices tests must follow the data, not just the response.
  • Using Thread.sleep for async assertions. Fixed-duration sleeps make tests slow on fast hardware and brittle on slow hardware. They encode a specific wait time that becomes wrong whenever the system's performance characteristics change. Polling with a timeout is always the correct tool.
  • Querying service databases directly in tests. It feels faster and simpler than calling the API. It couples your tests to the service's internal schema, which is considered a private implementation detail. Schema changes and ORM migrations will break your tests without changing any public interface. Always assert through the service's own API.
  • Sharing test data between tests. In a multi-service environment, shared test data creates invisible coupling. One test modifies the user record that another test depends on, and failures appear in tests that have no obvious relationship to the change. Each test should create its own isolated state and clean up after itself.

🎯 Practice task

Audit an existing integration test for the six challenges — 45 minutes.

  1. Pick one integration test from an existing microservices project or a public sample repo. If you don't have one, set up the Spring PetClinic microservices project which has multiple services.
  2. Check distributed state. List every service that the test's scenario touches. For each service, check whether the test makes an assertion against that service's API. Note which services are asserted and which are assumed.
  3. Check network failure coverage. Identify every inter-service HTTP call in the happy path. For each one, ask: is there a test that injects a 503 or a timeout on this call? Is the fallback path tested?
  4. Check test data creation. Find where the test creates its prerequisite data. Is it calling APIs or inserting directly into a database? If the latter, identify which service owns each table being written to.
  5. Check async assertions. Find any Thread.sleep calls. Replace one with an Awaitility await().atMost(10, SECONDS).until(() -> ...) call. Verify the test still passes.
  6. Write a gap list. Document the specific scenarios not covered — missing service assertions, untested failure paths, direct DB inserts. This gap list is your starting point for improving the suite's real-world coverage.

Next lesson: how to design a test strategy that decides which tests to write at which layer across a microservices system.

// tip to track lessons you complete and pick up where you left off across devices.