Q32 of 37 · API testing
How do you reproduce and write a regression test for a production-only race condition?
Short answer
Short answer: Reproduce locally with controlled concurrency: identical concurrent requests, fast iteration. If it won't reproduce locally, instrument to capture the production timing, then reproduce in a staging environment under matching load. The regression test runs N parallel calls and asserts on the invariant that the bug violated.
Detail
Race conditions are the classic "works on my machine" bug — they need concurrency to manifest, and tests usually run sequentially. The discipline:
Step 1 — Understand the invariant the bug violated. The bug is "two users created with the same email." The invariant: "email is unique across users." That's what the regression test asserts.
Step 2 — Reproduce locally. Try the simplest first:
const calls = Array.from({ length: 50 }, () =>
request.post('/users', { data: { email: 'race@test.com' } })
);
const results = await Promise.all(calls);
const successes = results.filter((r) => r.status() === 201);
50 concurrent identical requests; if more than one succeeds, the race is reproduced.
If it doesn't reproduce, increase pressure:
- More concurrent requests (200, 1000).
- Run inside the database's slowest operation (a deliberately heavy query in another transaction holding locks).
- Reduce CPU available (
tasksetto one core).
Step 3 — Add timing precision. If it still won't reproduce, the race needs sub-millisecond alignment. Tools:
- Network-level timing —
tcpdumptraces of production requests, then replay. - Application-level instrumentation — add temporary timing logs around the suspected critical section.
- Chaos / latency injection — Toxiproxy adds artificial delay to slow the suspected path, widening the race window.
Step 4 — Reproduce in staging. Match production conditions: same database engine and version, similar concurrency profile. Run the test under load (k6, locust) against staging.
Step 5 — Write the regression test. Once you've reproduced, the test should:
- Use the same concurrency pattern that triggers the bug.
- Run N times in CI (10-100) — races aren't 100% reproducible.
- Assert on the invariant, not the symptom.
test.describe.parallel('race regression', () => {
for (let attempt = 0; attempt < 20; attempt++) {
test(`attempt ${attempt}: concurrent creates produce one user`, async ({ request }) => {
const email = `race-${attempt}-${Date.now()}@test`;
const calls = Array.from({ length: 25 }, () =>
request.post('/users', { data: { email } })
);
const results = await Promise.all(calls);
const ok = results.filter((r) => r.status() === 201);
expect(ok.length).toBe(1);
const list = await (await request.get(`/users?email=${email}`)).json();
expect(list.data.length).toBe(1);
});
}
});
Step 6 — Fix verification. Run the regression test against the fix. It must pass 50+ times consecutively without flake. Races aren't fixed if the test is "usually" green.
Long-term:
- Retain the regression test forever — races love to come back as code changes.
- Tag it (
@race,@concurrency) so future engineers know it's load-sensitive. - Don't let it become flake-quarantined. If it fails intermittently, the underlying fix didn't hold; investigate.
The senior signal: invariant-first thinking, structured reproduction (local → instrument → staging-under-load), and refusing to call the bug fixed without a stable regression test.