Q32 of 37 · API testing

How do you reproduce and write a regression test for a production-only race condition?

API testingSeniorapirace-conditionsconcurrencyregressionsenior

Short answer

Short answer: Reproduce locally with controlled concurrency: identical concurrent requests, fast iteration. If it won't reproduce locally, instrument to capture the production timing, then reproduce in a staging environment under matching load. The regression test runs N parallel calls and asserts on the invariant that the bug violated.

Detail

Race conditions are the classic "works on my machine" bug — they need concurrency to manifest, and tests usually run sequentially. The discipline:

Step 1 — Understand the invariant the bug violated. The bug is "two users created with the same email." The invariant: "email is unique across users." That's what the regression test asserts.

Step 2 — Reproduce locally. Try the simplest first:

const calls = Array.from({ length: 50 }, () =>
  request.post('/users', { data: { email: 'race@test.com' } })
);
const results = await Promise.all(calls);
const successes = results.filter((r) => r.status() === 201);

50 concurrent identical requests; if more than one succeeds, the race is reproduced.

If it doesn't reproduce, increase pressure:

  • More concurrent requests (200, 1000).
  • Run inside the database's slowest operation (a deliberately heavy query in another transaction holding locks).
  • Reduce CPU available (taskset to one core).

Step 3 — Add timing precision. If it still won't reproduce, the race needs sub-millisecond alignment. Tools:

  • Network-level timingtcpdump traces of production requests, then replay.
  • Application-level instrumentation — add temporary timing logs around the suspected critical section.
  • Chaos / latency injection — Toxiproxy adds artificial delay to slow the suspected path, widening the race window.

Step 4 — Reproduce in staging. Match production conditions: same database engine and version, similar concurrency profile. Run the test under load (k6, locust) against staging.

Step 5 — Write the regression test. Once you've reproduced, the test should:

  • Use the same concurrency pattern that triggers the bug.
  • Run N times in CI (10-100) — races aren't 100% reproducible.
  • Assert on the invariant, not the symptom.
test.describe.parallel('race regression', () => {
  for (let attempt = 0; attempt < 20; attempt++) {
    test(`attempt ${attempt}: concurrent creates produce one user`, async ({ request }) => {
      const email = `race-${attempt}-${Date.now()}@test`;
      const calls = Array.from({ length: 25 }, () =>
        request.post('/users', { data: { email } })
      );
      const results = await Promise.all(calls);
      const ok = results.filter((r) => r.status() === 201);
      expect(ok.length).toBe(1);

      const list = await (await request.get(`/users?email=${email}`)).json();
      expect(list.data.length).toBe(1);
    });
  }
});

Step 6 — Fix verification. Run the regression test against the fix. It must pass 50+ times consecutively without flake. Races aren't fixed if the test is "usually" green.

Long-term:

  • Retain the regression test forever — races love to come back as code changes.
  • Tag it (@race, @concurrency) so future engineers know it's load-sensitive.
  • Don't let it become flake-quarantined. If it fails intermittently, the underlying fix didn't hold; investigate.

The senior signal: invariant-first thinking, structured reproduction (local → instrument → staging-under-load), and refusing to call the bug fixed without a stable regression test.

// WHAT INTERVIEWERS LOOK FOR

Invariant-based assertion (not just 'no error'), reproduction techniques in priority order, retention of the race test, and refusal to accept 'works most of the time' as fixed.

// COMMON PITFALL

Patching the symptom (return 409 sometimes) without the regression test, and patting yourself on the back when the bug 'doesn't reproduce' for a week. Races recur; the test is the receipt.