Q20 of 38 · Performance
What's your approach to soak testing memory leaks over 8+ hour runs?
Short answer
Short answer: Run sustained moderate load (30-50% of peak) for 8-24 hours while capturing heap, RSS, file descriptors, DB connection counts, and GC frequency at intervals. Plot trends — flat = healthy, monotonic rise = leak. Take heap dumps at intervals for diff analysis.
Detail
Why soak tests find what load tests can't: a leak of 1MB per 1000 requests is invisible in a 30-minute load test (< 50MB total) but kills the process after 24 hours (1.4GB). Soak amplifies time enough to see the slope.
Test setup:
- Load: 30-50% of peak. Goal isn't to stress, it's to exercise every code path repeatedly.
- Duration: 8 hours minimum, 24 hours ideal. Some leaks (DB-backed caches, external session stores) only manifest after a daily cycle.
- Variety: rotate through the realistic transaction mix. A single endpoint loop won't exercise the leaking path.
What to monitor (every minute or finer):
- Heap used — JVM, Node, .NET. Grows monotonically? Leak.
- RSS (resident set size) — OS-level memory. Diverges from heap? Off-heap allocation leak (DirectByteBuffer, native libs).
- File descriptor count —
lsof | wc -lor/proc/<pid>/fd. Climbing? Unclosed sockets/files. - DB connections — pool checkout count. Climbing? Connections not being returned.
- GC stats — frequency, duration, full-GC count. GC working harder over time = heap pressure.
- Thread count — leaking threads is rarer but devastating.
Analysis:
- Plot each metric vs. time. Visual inspection beats statistics for slope detection.
- Take heap dumps at hour 0, hour 4, hour 8 — diff with Eclipse MAT or VisualVM. Objects that grow disproportionately are suspects.
- Correlate with logs: which transactions ran during the slope? That's where the leak path lives.
Handling false positives:
- Caches that grow to a steady state (LRU) are not leaks — they level off.
- Connection pools that ramp to max and stay are not leaks.
- A genuine leak grows without bound.
CI integration: weekly job, posts results to a dashboard, alerts on slope > X MB/hour. Don't gate PRs on soak — too slow, too noisy.
// WHAT INTERVIEWERS LOOK FOR
// COMMON PITFALL
// Related questions
Walk me through how you'd plan capacity testing for a Black Friday spike.
Performance
How do you build a sustainable performance test suite that runs in CI without becoming a bottleneck?
Performance
Walk through how you'd diagnose a memory leak in a long-running Java service.
Core Java
How do you identify and fix memory leaks in a long-running Node.js test process?
JavaScript