Q30 of 37 · API testing

Walk through how you'd test eventual consistency in a distributed system.

API testingSeniorapiconsistencydistributedpollingsenior

Short answer

Short answer: Tests must wait for convergence, not assume it. Poll for the expected state with a sensible timeout. Don't assert immediately after a write. Test the read-after-write SLA explicitly. Add chaos: simulate replication lag, network partitions, and verify the system converges within bounds.

Detail

In an eventually consistent system, a write to one node isn't immediately visible at all read replicas. Test designs that assume immediate visibility flake under load and break against real production behaviour.

The fundamental shift: replace "assert immediately after write" with "wait for convergence."

// ❌ Wrong — assumes synchronous propagation
test('user appears in list after create', async () => {
  const user = await createUser({ email: 'a@x.com' });
  const list = await listUsers();
  expect(list).toContainEqual(user);     // flake on replication lag
});

// ✅ Right — wait for convergence
test('user appears in list after create', async () => {
  const user = await createUser({ email: 'a@x.com' });
  await waitFor(async () => {
    const list = await listUsers();
    return list.some((u) => u.id === user.id);
  }, { timeout: 5000 });
});

The SLA conversation: every eventually consistent system has a real (or claimed) convergence window — milliseconds for in-region replicas, seconds for cross-region, minutes for some search index pipelines. Tests should encode the SLA:

// Read-your-writes SLA: 1 second
await waitFor(predicate, { timeout: 1000 });

A failing test now means "the SLA was violated," not "tests are flaky."

What to test:

1. Read-your-writes. After a write, a read from the same client should see it (often guaranteed by routing to the primary). Verify the SLA holds.

2. Cross-replica visibility. Write to one region; read from another — measure how long convergence takes. Bonus: assert it's within target.

3. Index lag. After creating a record, query a secondary index (search, materialised view). May take longer; SLA is wider.

4. Convergence under load. Send 1000 writes; assert that within N seconds, all reads return the full set.

5. Conflict resolution (CRDTs, last-write-wins). Concurrent writes from two regions: which wins? Document and test the resolution policy.

6. Failure cases. Replication lag spikes, replica down — does the system surface stale data with a marker, or refuse the read, or retry?

Tooling:

  • Polling helpers (waitFor, eventually, Awaitility in JVM) — the standard pattern.
  • Chaos tools (Toxiproxy, Chaos Mesh) — inject latency between replicas to test convergence under stress.
  • Test environments with replication — many staging environments are single-node, hiding bugs the production multi-node will surface. Push for at-least-two-replica staging.

The honest test design:

  • Slowest expected convergence drives the timeout.
  • The faster path is asserted separately.
  • Failures point at SLA violation, not test flake.

Anti-patterns:

  • Thread.sleep(2000) — too short under load, too slow when convergence is fast.
  • Disabling tests during "known replication delays" — these are real bugs the test should surface.
  • Testing only on a single-node dev environment — production is the multi-node case; bugs hide otherwise.

The senior signal: testing convergence as a property with a measurable SLA, using polling with timeouts, and treating slow convergence as the test target rather than a nuisance to wait through.

// WHAT INTERVIEWERS LOOK FOR

Polling-with-timeout pattern, SLA-driven timeouts, awareness of read-your-writes vs cross-replica vs index lag, and chaos testing for convergence under stress.

// COMMON PITFALL

Adding fixed sleeps to 'wait long enough.' Either the test passes when it shouldn't (5s sleep, real convergence is 8s), or the suite is unbearably slow.