On-call handover
Context for the next on-call rotation: open incidents, hot systems, deferred work, watch-outs.
On-call Handover — YYYY-MM-DD
Outgoing on-call: Name Incoming on-call: Name Handover period: YYYY-MM-DD HH:MM UTC → YYYY-MM-DD HH:MM UTC Time zone note: Any time zone considerations for the incoming person
1. Rotation Details
| Field | Detail |
|---|---|
| Outgoing on-call | Name |
| Incoming on-call | Name |
| Shift start (outgoing) | YYYY-MM-DD HH:MM UTC |
| Shift end (outgoing) | YYYY-MM-DD HH:MM UTC |
| Backup on-call (incoming shift) | Name |
2. Open Incidents (Active or Recently Closed)
| Incident | Severity | Status | Owner | Next action | Link |
|---|---|---|---|---|---|
| Incident title or ID | SEV-1/2/3 | Active / Monitoring / Closed | Name | Describe what needs to happen next | Link |
If there are no open incidents, state: "No open incidents at handover."
3. Recent Deploys (Last 48 h)
| Service | Version | Deployed by | Deployed at | Notes / watch-outs |
|---|---|---|---|---|
| Service name | v0.0 | Name | YYYY-MM-DD HH:MM UTC | Anything worth monitoring |
If no deploys in the last 48 h, state: "No deploys in the last 48 h."
4. Hot Systems / Watch-outs
| System | Why it's hot | What to watch for | Mitigation if it goes wrong |
|---|---|---|---|
| System name | Brief context | Specific metric or symptom | Mitigation step or runbook link |
5. Deferred Work
Things that came up during your shift that you did not action — context for the incoming on-call.
- Item: brief description and why it was deferred
- Item: brief description and why it was deferred
If nothing deferred, state: "Nothing deferred."
6. Useful Links
| Resource | Link |
|---|---|
| Monitoring dashboard | URL |
| Alerting console | URL |
| Incident runbook | URL |
| Status page admin | URL |
| On-call calendar | URL |
| Escalation contacts | URL or list |
7. Sign-off
Outgoing on-call: Name — YYYY-MM-DD HH:MM UTC Incoming on-call confirmed receipt: Name — YYYY-MM-DD HH:MM UTC
Anything else worth noting before you hand over:
On-call Handover — 2024-05-17 (Friday evening → Monday morning)
Outgoing on-call: Jordan Osei Incoming on-call: Priya Mehta Handover period: 2024-05-17 18:00 UTC → 2024-05-20 09:00 UTC Time zone note: Jordan is in London (BST = UTC+1). Priya is in Berlin (CEST = UTC+2). All times in this document are UTC.
1. Rotation Details
| Field | Detail |
|---|---|
| Outgoing on-call | Jordan Osei |
| Incoming on-call | Priya Mehta |
| Shift start (outgoing) | 2024-05-13 09:00 UTC |
| Shift end (outgoing) | 2024-05-17 18:00 UTC |
| Backup on-call (incoming shift) | Sam Reid (DBA), Marcus Webb (Security) |
2. Open Incidents (Active or Recently Closed)
| Incident | Severity | Status | Owner | Next action | Link |
|---|---|---|---|---|---|
| INC-2024-047 Payment processing degradation | SEV-2 | Closed — monitoring | Jordan Osei | Monitor CheckoutErrorRate and DatabaseConnections over the weekend. If error rate rises above 0.5%, follow the DB connection pool runbook section in #inc-2024-05-17-payments-degraded. Fatima (Engineering Director) is aware and expects an update Monday morning. |
Notion/Incidents/INC-2024-047 |
Context on INC-2024-047: Checkout error rate hit 2.3% on Friday afternoon due to RDS connection pool exhaustion introduced by the v2.4 async payment orchestration layer. We fixed it by increasing max_connections to 200. The RCA is scheduled for Monday 2024-05-20 at 10:00 UTC — I'll be on that call. You do not need to prepare anything for it unless something changes over the weekend.
3. Recent Deploys (Last 48 h)
| Service | Version | Deployed by | Deployed at | Notes / watch-outs |
|---|---|---|---|---|
| Payments API | v2.4.1 (hotfix) | Sam Reid | 2024-05-17 15:10 UTC | Increased max_connections to 200. Low-risk config change — required RDS instance restart (< 30 s downtime, no alerts fired). |
| Notification Service | v1.9.3 | Dev Patel | 2024-05-16 14:22 UTC | Routine dependency bump. No issues observed in the 28 h since deploy. |
4. Hot Systems / Watch-outs
| System | Why it's hot | What to watch for | Mitigation if it goes wrong |
|---|---|---|---|
| Payments API — RDS connection pool | Just recovered from INC-2024-047. The fix is live but the root cause (async connection lifecycle) has not been addressed in code yet. | DatabaseConnections metric in Datadog → alert fires at 160/200 (80%). If you see it climbing steadily, investigate immediately. |
Runbook: #inc-2024-05-17-payments-degraded (pinned). Emergency: increase max_connections again to 300 (Sam Reid has the RDS credentials and knows the process). |
| Search Service | Elasticseach cluster has been running at 70% disk usage for two weeks. Ticket INFRA-892 is open. Not urgent but could become SEV-3 if disk fills. | Datadog → Elasticsearch dashboard → Disk Usage %. Alert fires at 85%. |
Contact Dev Patel — he owns INFRA-892. Do not delete indices without checking with him first. |
5. Deferred Work
- INFRA-891 — Add
DatabaseConnectionsalert: I created the Datadog monitor but it is not yet saved (I ran out of time). The monitor is in draft at Datadog/Monitors/Drafts — "DB Connections 80% ceiling". Please save and activate it on Monday morning, or ask Sam Reid to do so. This is listed as a P1 action item in the INC-2024-047 RCA. - PagerDuty schedule audit — Fatima asked us to verify the on-call rotation is correct for June. I have not done this yet. Low urgency — can wait until next week. The ask is in
#platform-oncall(search "rotation June").
6. Useful Links
| Resource | Link |
|---|---|
| Monitoring dashboard | grafana.acmecorp.internal/d/platform-overview |
| Alerting console | app.datadoghq.com/monitors/manage |
| Incident runbook | Notion/Runbooks/Incident-Response |
| Status page admin | manage.statuspage.io (credentials in 1Password → "StatusPage Admin") |
| On-call calendar | PagerDuty → Schedules → "Platform On-call" |
| Escalation contacts | Notion/Runbooks/Escalation-Contacts |
7. Sign-off
Outgoing on-call: Jordan Osei — 2024-05-17 17:55 UTC
Incoming on-call confirmed receipt: Priya Mehta — 2024-05-17 18:03 UTC
Jordan's parting note: The weekend is likely to be quiet — no planned deploys, no sales events. The only thing I am actively watching is the INC-2024-047 recovery. If anything comes up, Sam Reid is a great first call for anything infrastructure-related. Have a good shift.
// Related templates
Incident response runbook
On-call playbook: severity ladder, triage flow, comms templates, escalation paths.
Root cause analysis
5 Whys, fishbone, timeline of events, contributing factors, action items.
Sprint retrospective
What went well, what didn't, action items. QA-flavoured retro template for sprint reviews.