Root cause analysis
5 Whys, fishbone, timeline of events, contributing factors, action items.
Root Cause Analysis — Incident ID / Title
Severity: SEV-1 / SEV-2 / SEV-3 Author: Name Review date: YYYY-MM-DD Status: Draft / In review / Final
1. Incident Summary
| Field | Detail |
|---|---|
| What happened | One or two sentences describing the failure mode |
| When it started | YYYY-MM-DD HH:MM UTC (real start — not when it was detected) |
| When it was detected | YYYY-MM-DD HH:MM UTC |
| Who detected it | Person or monitoring system |
| When it was resolved | YYYY-MM-DD HH:MM UTC |
| Total duration | X h Y min |
| User impact | Number of users or % affected; what they could not do |
2. Timeline of Events
| Time (UTC) | Event | Source / owner |
|---|---|---|
| HH:MM | Incident start — what happened in the system | Monitoring / log |
| HH:MM | Alert fired | PagerDuty / alerting tool |
| HH:MM | On-call acknowledged | Name |
| HH:MM | Severity declared and incident channel opened | Name |
| HH:MM | First hypothesis formed | Name |
| HH:MM | Mitigation action attempted (describe) | Name |
| HH:MM | Mitigation successful / unsuccessful — describe | Name |
| HH:MM | Incident resolved | Name |
| HH:MM | Comms sent to customers / status page updated | Name |
3. Root Cause Analysis
3.1 Five Whys
Problem statement: Describe the observable failure in one sentence.
- Why did [the failure] occur? Answer.
- Why did [answer 1] happen? Answer.
- Why did [answer 2] happen? Answer.
- Why did [answer 3] happen? Answer.
- Why did [answer 4] happen? Answer — this is typically the root cause.
Root cause: State the root cause in one clear sentence.
3.2 Contributing Factors
Technical
- e.g. Lack of connection pool monitoring alert
- e.g. No circuit breaker on the downstream dependency
Process
- e.g. Configuration change was not peer-reviewed
- e.g. Runbook did not cover this failure mode
Organisational
- e.g. On-call rotation had a single point of failure — no trained backup
Human
- e.g. Engineer was not aware that the setting had a production impact
4. What Worked Well
- e.g. The alerting fired within 2 minutes of the error rate exceeding the threshold
- e.g. The team assembled quickly and communication was clear throughout
- e.g. The rollback procedure worked as documented
5. What Did Not Work Well
- e.g. Detection relied on a customer complaint rather than an alert — 18-minute gap
- e.g. The runbook did not describe how to verify recovery, causing confusion
- e.g. Status page was not updated until 45 minutes after the incident was confirmed
6. Action Items
| Action | Owner | Priority | Target date | Status |
|---|---|---|---|---|
| Fix: describe technical fix | Name | P1 / P2 / P3 | Date | Open |
| Add monitoring alert for [X] | Name | P1 / P2 / P3 | Date | Open |
| Update runbook to cover [scenario] | Name | P1 / P2 / P3 | Date | Open |
| Process improvement: [describe] | Name | P1 / P2 / P3 | Date | Open |
7. Lessons Learned
- Generalised lesson that applies beyond this incident — e.g. "Configuration changes to shared infrastructure should always be peer-reviewed and applied to a staging environment first."
- Lesson 2
- Lesson 3
Root Cause Analysis — INC-2024-047 Payment Processing Degradation
Severity: SEV-2 Author: Jordan Osei Review date: 2024-05-20 Status: Final
1. Incident Summary
| Field | Detail |
|---|---|
| What happened | Checkout error rate rose from < 0.1% to 2.3% due to RDS PostgreSQL connection pool exhaustion, causing HTTP 503 responses for approximately 8% of payment attempts |
| When it started | 2024-05-17 13:52 UTC (confirmed from DB connection metric) |
| When it was detected | 2024-05-17 14:04 UTC (alert fired when error rate exceeded 1% for 5 consecutive minutes) |
| Who detected it | Datadog — CheckoutErrorRate monitor |
| When it was resolved | 2024-05-17 15:38 UTC (connection pool limit increased; error rate returned to < 0.1%) |
| Total duration | 1 h 46 min (12 min detection lag; 1 h 34 min active response) |
| User impact | ~1 800 failed payment attempts; no data loss or incorrect charges; affected users received an HTTP 503 and could retry successfully after 15:38 UTC |
2. Timeline of Events
| Time (UTC) | Event | Source / owner |
|---|---|---|
| 13:52 | RDS DatabaseConnections reached the max_connections ceiling (100) — new checkout requests began returning HTTP 503 |
Datadog RDS dashboard |
| 14:04 | CheckoutErrorRate Datadog monitor fired (5-min evaluation window elapsed) |
PagerDuty → Jordan Osei |
| 14:07 | Jordan acknowledged PagerDuty alert; opened #inc-2024-05-17-payments-degraded |
Jordan Osei |
| 14:09 | SEV-2 declared; #platform-oncall notified; Engineering Director Fatima Yusuf notified via Slack DM |
Jordan Osei |
| 14:15 | Ruled out recent deploy — last deploy was 2024-05-15; no feature flag changes in the last 24 h | Jordan Osei |
| 14:22 | Identified DatabaseConnections = 100/100 in Datadog; suspected connection pool exhaustion |
Jordan Osei |
| 14:35 | DBA on-call (Sam Reid) joined the incident channel | Sam Reid |
| 14:40 | Status page updated: "Investigating checkout issues" | Jordan Osei |
| 14:55 | Confirmed root cause: v2.4 async payment orchestration layer holds connections longer — pool exhausts under sustained 500 RPS load | Sam Reid |
| 15:20 | max_connections increased to 200 on RDS (parameter group updated, instance restarted) |
Sam Reid |
| 15:38 | Error rate returned to < 0.1% for 5 consecutive minutes; recovery confirmed | Jordan Osei |
| 15:40 | Incident declared resolved; status page updated to "All Systems Operational" | Jordan Osei |
| 15:45 | Resolution comms sent to Customer Success; stakeholders notified | Jordan Osei |
| 16:00 | RCA session scheduled for 2024-05-20 | Jordan Osei |
3. Root Cause Analysis
3.1 Five Whys
Problem statement: Checkout error rate rose to 2.3% on 2024-05-17, causing ~1 800 failed payment attempts over 1 h 46 min.
- Why did the checkout error rate rise to 2.3%? Because the API was returning HTTP 503 for a significant proportion of payment requests.
- Why was the API returning HTTP 503? Because the RDS PostgreSQL connection pool was fully exhausted — no connections were available for new requests.
- Why was the connection pool exhausted? Because the v2.4 async payment orchestration layer holds database connections open for longer (awaiting async callback confirmation) compared to the synchronous v2.3 path, consuming more connections per request at the same throughput level.
- Why was the connection pool not sized to account for this?
Because the
max_connectionsvalue was not reviewed or updated when the async orchestration design was introduced in v2.4. - Why was the connection lifecycle impact not caught before production? Because the performance test soak run (PTR-2024-011) identified the connection pool exhaustion — but the release was approved before the remediation action was completed. The blocking decision was made on the strength of the load and stress test results, which passed.
Root cause: The v2.4 release was approved and deployed before the connection pool exhaustion identified in the pre-release soak test was remediated, because the release approval process did not require soak test sign-off for a conditional pass verdict.
3.2 Contributing Factors
Technical
- No alert existed for
DatabaseConnectionsapproaching the ceiling — the only alerting was on the downstream error rate, introducing a 12-minute detection lag. max_connectionswas set to a default of 100 — never reviewed as throughput grew over the past two years.
Process
- The release approval gate required load and stress test sign-off but not soak test sign-off, allowing a known conditional pass to ship.
- The performance test report (PTR-2024-011) recommended a hold, but the release manager was not explicitly required to review and sign off on performance test reports before approving releases.
Organisational
- Performance testing is owned by QA; release approval is owned by Engineering management. There was no formal handshake between the two, so the "hold" recommendation did not block the release.
Human
- The release manager reviewed the executive summary of the performance report (which said "conditional pass") but did not read far enough to see the soak test failure or the "hold release" recommendation.
4. What Worked Well
- Monitoring detected the issue within 12 minutes of it starting — fast relative to previous incidents.
- The incident channel was opened and stakeholders were notified within 17 minutes of the alert firing.
- Sam Reid (DBA on-call) joined quickly and diagnosed the root cause within 20 minutes of joining.
- The mitigation (increasing
max_connections) was low-risk and effective, with no service disruption during the RDS parameter change. - The resolution message and status page update were accurate and published promptly.
5. What Did Not Work Well
- The 12-minute detection lag was caused by the 5-minute evaluation window on the error rate alert — a connection saturation alert would have detected the issue at 13:52 rather than 14:04.
- The release approval process allowed a known conditional-pass performance result to reach production without an explicit sign-off from the performance tester.
- The executive summary of the performance report was ambiguous — "conditional pass" did not clearly communicate that the recommendation was to hold the release.
- Status page update was 36 minutes after the incident was declared — slower than the 30-minute target in the runbook.
6. Action Items
| Action | Owner | Priority | Target date | Status |
|---|---|---|---|---|
Add Datadog alert: DatabaseConnections > 80% of max_connections for 2 consecutive minutes → PagerDuty |
Sam Reid | P1 | 2024-05-24 | Open |
Increase max_connections to 200 on all environments (already done in production; replicate to staging and perf) |
Sam Reid | P1 | 2024-05-24 | Open |
| Update release approval process: soak test sign-off required if performance test verdict is not a clean pass | Engineering Director | P1 | 2024-05-31 | Open |
| Update performance test report template: use PASS / FAIL / HOLD — remove "conditional pass" wording | Jordan Osei | P2 | 2024-05-28 | Open |
Review max_connections for all other services — audit for any that have not been reviewed in > 12 months |
DBA team | P2 | 2024-06-14 | Open |
| Add a "connection pool saturation" scenario to the incident runbook | Jordan Osei | P3 | 2024-06-07 | Open |
7. Lessons Learned
- A performance test "hold" recommendation must be a hard gate in the release process, not a soft signal that can be overridden without a written exception. Build the gate into the workflow, not the document.
- Alerting on leading indicators (resource saturation) catches incidents earlier than alerting on lagging indicators (error rates). Design alerts to fire before users are affected, not after.
- The audience for an executive summary is people who may act on it without reading further. The summary must carry the full verdict — including recommendations — rather than softening them.
// Related templates
Incident response runbook
On-call playbook: severity ladder, triage flow, comms templates, escalation paths.
On-call handover
Context for the next on-call rotation: open incidents, hot systems, deferred work, watch-outs.
Defect triage notes
Triage meeting agenda, defect prioritisation grid, and decisions log. For weekly QA syncs.