Root cause analysis

5 Whys, fishbone, timeline of events, contributing factors, action items.

750 wordsRCA5 WhysFishbonePost-mortem

Root Cause Analysis — Incident ID / Title

Severity: SEV-1 / SEV-2 / SEV-3 Author: Name Review date: YYYY-MM-DD Status: Draft / In review / Final

1. Incident Summary

Field	Detail
What happened	One or two sentences describing the failure mode
When it started	YYYY-MM-DD HH:MM UTC (real start — not when it was detected)
When it was detected	YYYY-MM-DD HH:MM UTC
Who detected it	Person or monitoring system
When it was resolved	YYYY-MM-DD HH:MM UTC
Total duration	X h Y min
User impact	Number of users or % affected; what they could not do

2. Timeline of Events

Time (UTC)	Event	Source / owner
HH:MM	Incident start — what happened in the system	Monitoring / log
HH:MM	Alert fired	PagerDuty / alerting tool
HH:MM	On-call acknowledged	Name
HH:MM	Severity declared and incident channel opened	Name
HH:MM	First hypothesis formed	Name
HH:MM	Mitigation action attempted (describe)	Name
HH:MM	Mitigation successful / unsuccessful — describe	Name
HH:MM	Incident resolved	Name
HH:MM	Comms sent to customers / status page updated	Name

3. Root Cause Analysis

3.1 Five Whys

Problem statement: Describe the observable failure in one sentence.

Why did [the failure] occur? Answer.
Why did [answer 1] happen? Answer.
Why did [answer 2] happen? Answer.
Why did [answer 3] happen? Answer.
Why did [answer 4] happen? Answer — this is typically the root cause.

Root cause: State the root cause in one clear sentence.

3.2 Contributing Factors

Technical

e.g. Lack of connection pool monitoring alert
e.g. No circuit breaker on the downstream dependency

Process

e.g. Configuration change was not peer-reviewed
e.g. Runbook did not cover this failure mode

Organisational

e.g. On-call rotation had a single point of failure — no trained backup

Human

e.g. Engineer was not aware that the setting had a production impact

4. What Worked Well

e.g. The alerting fired within 2 minutes of the error rate exceeding the threshold
e.g. The team assembled quickly and communication was clear throughout
e.g. The rollback procedure worked as documented

5. What Did Not Work Well

e.g. Detection relied on a customer complaint rather than an alert — 18-minute gap
e.g. The runbook did not describe how to verify recovery, causing confusion
e.g. Status page was not updated until 45 minutes after the incident was confirmed

6. Action Items

Action	Owner	Priority	Target date	Status
Fix: describe technical fix	Name	P1 / P2 / P3	Date	Open
Add monitoring alert for [X]	Name	P1 / P2 / P3	Date	Open
Update runbook to cover [scenario]	Name	P1 / P2 / P3	Date	Open
Process improvement: [describe]	Name	P1 / P2 / P3	Date	Open

7. Lessons Learned

Generalised lesson that applies beyond this incident — e.g. "Configuration changes to shared infrastructure should always be peer-reviewed and applied to a staging environment first."
Lesson 2
Lesson 3

Root Cause Analysis — INC-2024-047 Payment Processing Degradation

Severity: SEV-2 Author: Jordan Osei Review date: 2024-05-20 Status: Final

1. Incident Summary

Field	Detail
What happened	Checkout error rate rose from < 0.1% to 2.3% due to RDS PostgreSQL connection pool exhaustion, causing HTTP 503 responses for approximately 8% of payment attempts
When it started	2024-05-17 13:52 UTC (confirmed from DB connection metric)
When it was detected	2024-05-17 14:04 UTC (alert fired when error rate exceeded 1% for 5 consecutive minutes)
Who detected it	Datadog — CheckoutErrorRate monitor
When it was resolved	2024-05-17 15:38 UTC (connection pool limit increased; error rate returned to < 0.1%)
Total duration	1 h 46 min (12 min detection lag; 1 h 34 min active response)
User impact	~1 800 failed payment attempts; no data loss or incorrect charges; affected users received an HTTP 503 and could retry successfully after 15:38 UTC

2. Timeline of Events

Time (UTC)	Event	Source / owner
13:52	RDS `DatabaseConnections` reached the `max_connections` ceiling (100) — new checkout requests began returning HTTP 503	Datadog RDS dashboard
14:04	`CheckoutErrorRate` Datadog monitor fired (5-min evaluation window elapsed)	PagerDuty → Jordan Osei
14:07	Jordan acknowledged PagerDuty alert; opened `#inc-2024-05-17-payments-degraded`	Jordan Osei
14:09	SEV-2 declared; `#platform-oncall` notified; Engineering Director Fatima Yusuf notified via Slack DM	Jordan Osei
14:15	Ruled out recent deploy — last deploy was 2024-05-15; no feature flag changes in the last 24 h	Jordan Osei
14:22	Identified `DatabaseConnections = 100/100` in Datadog; suspected connection pool exhaustion	Jordan Osei
14:35	DBA on-call (Sam Reid) joined the incident channel	Sam Reid
14:40	Status page updated: "Investigating checkout issues"	Jordan Osei
14:55	Confirmed root cause: v2.4 async payment orchestration layer holds connections longer — pool exhausts under sustained 500 RPS load	Sam Reid
15:20	`max_connections` increased to 200 on RDS (parameter group updated, instance restarted)	Sam Reid
15:38	Error rate returned to < 0.1% for 5 consecutive minutes; recovery confirmed	Jordan Osei
15:40	Incident declared resolved; status page updated to "All Systems Operational"	Jordan Osei
15:45	Resolution comms sent to Customer Success; stakeholders notified	Jordan Osei
16:00	RCA session scheduled for 2024-05-20	Jordan Osei

3. Root Cause Analysis

3.1 Five Whys

Problem statement: Checkout error rate rose to 2.3% on 2024-05-17, causing ~1 800 failed payment attempts over 1 h 46 min.

Why did the checkout error rate rise to 2.3%? Because the API was returning HTTP 503 for a significant proportion of payment requests.
Why was the API returning HTTP 503? Because the RDS PostgreSQL connection pool was fully exhausted — no connections were available for new requests.
Why was the connection pool exhausted? Because the v2.4 async payment orchestration layer holds database connections open for longer (awaiting async callback confirmation) compared to the synchronous v2.3 path, consuming more connections per request at the same throughput level.
Why was the connection pool not sized to account for this? Because the max_connections value was not reviewed or updated when the async orchestration design was introduced in v2.4.
Why was the connection lifecycle impact not caught before production? Because the performance test soak run (PTR-2024-011) identified the connection pool exhaustion — but the release was approved before the remediation action was completed. The blocking decision was made on the strength of the load and stress test results, which passed.

Root cause: The v2.4 release was approved and deployed before the connection pool exhaustion identified in the pre-release soak test was remediated, because the release approval process did not require soak test sign-off for a conditional pass verdict.

3.2 Contributing Factors

Technical

No alert existed for DatabaseConnections approaching the ceiling — the only alerting was on the downstream error rate, introducing a 12-minute detection lag.
max_connections was set to a default of 100 — never reviewed as throughput grew over the past two years.

Process

The release approval gate required load and stress test sign-off but not soak test sign-off, allowing a known conditional pass to ship.
The performance test report (PTR-2024-011) recommended a hold, but the release manager was not explicitly required to review and sign off on performance test reports before approving releases.

Organisational

Performance testing is owned by QA; release approval is owned by Engineering management. There was no formal handshake between the two, so the "hold" recommendation did not block the release.

Human

The release manager reviewed the executive summary of the performance report (which said "conditional pass") but did not read far enough to see the soak test failure or the "hold release" recommendation.

4. What Worked Well

Monitoring detected the issue within 12 minutes of it starting — fast relative to previous incidents.
The incident channel was opened and stakeholders were notified within 17 minutes of the alert firing.
Sam Reid (DBA on-call) joined quickly and diagnosed the root cause within 20 minutes of joining.
The mitigation (increasing max_connections) was low-risk and effective, with no service disruption during the RDS parameter change.
The resolution message and status page update were accurate and published promptly.

5. What Did Not Work Well

The 12-minute detection lag was caused by the 5-minute evaluation window on the error rate alert — a connection saturation alert would have detected the issue at 13:52 rather than 14:04.
The release approval process allowed a known conditional-pass performance result to reach production without an explicit sign-off from the performance tester.
The executive summary of the performance report was ambiguous — "conditional pass" did not clearly communicate that the recommendation was to hold the release.
Status page update was 36 minutes after the incident was declared — slower than the 30-minute target in the runbook.

6. Action Items

Action	Owner	Priority	Target date	Status
Add Datadog alert: `DatabaseConnections > 80%` of `max_connections` for 2 consecutive minutes → PagerDuty	Sam Reid	P1	2024-05-24	Open
Increase `max_connections` to 200 on all environments (already done in production; replicate to staging and perf)	Sam Reid	P1	2024-05-24	Open
Update release approval process: soak test sign-off required if performance test verdict is not a clean pass	Engineering Director	P1	2024-05-31	Open
Update performance test report template: use PASS / FAIL / HOLD — remove "conditional pass" wording	Jordan Osei	P2	2024-05-28	Open
Review `max_connections` for all other services — audit for any that have not been reviewed in > 12 months	DBA team	P2	2024-06-14	Open
Add a "connection pool saturation" scenario to the incident runbook	Jordan Osei	P3	2024-06-07	Open

7. Lessons Learned

A performance test "hold" recommendation must be a hard gate in the release process, not a soft signal that can be overridden without a written exception. Build the gate into the workflow, not the document.
Alerting on leading indicators (resource saturation) catches incidents earlier than alerting on lagging indicators (error rates). Design alerts to fire before users are affected, not after.
The audience for an executive summary is people who may act on it without reading further. The summary must carry the full verdict — including recommendations — rather than softening them.

// Related templates

Incident response runbook

On-call playbook: severity ladder, triage flow, comms templates, escalation paths.

On-call handover

Context for the next on-call rotation: open incidents, hot systems, deferred work, watch-outs.

Defect triage notes

Triage meeting agenda, defect prioritisation grid, and decisions log. For weekly QA syncs.