Back to blog
| Gregory Komissarov

On-call rotation design: schedules, escalation, and avoiding burnout

Designing an on-call rotation that keeps customers happy and engineers sane. Schedule patterns, escalation policies, alert hygiene, compensation, and the day-to-day handoff doc.

guideon-callincidentssre

A bad on-call rotation does two things to a team. First it leaks customer trust, slowly, through alerts that go unanswered or take too long to escalate. Then it leaks people, faster than you can hire, through burnout that hides behind individual resignations until you realise everyone who knew the payment service has left in a year.

Designing a good rotation is mostly common sense applied with discipline. The shape of the schedule, the escalation policy, the alert hygiene, the handoff process, and the compensation are all separate problems. Most teams get one or two right and lose people on the others.

This guide walks through each piece in the order they should be decided.

What “good” feels like

Before getting into specifics, here’s the shape of a healthy rotation as a quick gut check:

  • Engineers know exactly when they’re on-call, weeks in advance
  • The first hour of a shift is boring more often than not
  • A typical week produces 0–3 actionable pages outside business hours
  • Every page has a runbook, or the absence of one is itself a known gap
  • Handing over to the next shift takes ≤ 10 minutes
  • The rotation has a primary and a secondary, and the primary trusts the secondary will pick up if they don’t ack
  • Engineers can swap shifts without manager approval
  • The on-call window comes with paid time off and explicit compensation

If your rotation feels different from this, it’s worth asking which of the pieces below are out of place.

Schedule patterns

The right shape of a schedule depends on how big the team is and how much your service tolerates handoff latency.

Weekly rotation, single timezone

The simplest shape: one engineer is on-call from Monday morning to the next Monday morning. Works well for teams of 4–8 engineers in roughly the same timezone, where night-time incidents are rare.

Trade-off: the on-call week is heavy. If three pages happen at 03:00 in a single week, the engineer is wiped out for the whole week. A two-tier (primary + secondary) version of this pattern offsets the worst of it: the primary takes the page; the secondary takes over if the primary doesn’t ack within a window.

Follow-the-sun

Two or three rotations in different timezones, each covering business hours. London 09:00–17:00, then NYC 09:00–17:00, then Sydney 09:00–17:00. Nobody is paged at night.

Works only when the company can support engineers in multiple timezones, the service has good handoff hygiene (otherwise context is lost at every boundary), and the on-call work itself is amenable to async — long-running investigations are hard to hand off mid-stream.

Weekday + weekend split

The 168 hours of a week split into 5 weekday shifts (Mon–Fri 09:00–18:00) and 5 off-hours / weekend shifts (everything else). Engineers rotate through both, but separately.

Useful for teams large enough to support it (10+) where the off-hours load is meaningfully different from the weekday load. The trade-off is it’s harder to schedule and the on-call engineer for off-hours often has less context about whatever was deployed Friday afternoon.

Primary + secondary, always

Whatever the schedule shape, run two tiers. The primary handles every page; the secondary takes over if the primary doesn’t ack inside (typically) 5 minutes. The cost is a second engineer who’s mostly idle; the benefit is the primary can take a shower without the service going down.

Pattern selection cheat sheet

Team sizeTimezone spreadRecommended pattern
3–4Single timezoneWeekly, primary only
5–8Single timezoneWeekly, primary + secondary
5–8Spread over 2+ TZsWeekly with handoff at start of each engineer’s day
9–15Spread over 2+ TZsFollow-the-sun, two tiers each
15+AnywhereWeekday/weekend split, two tiers

There’s no perfect answer. The right shape is the one your team can actually staff sustainably.

The handoff

The handoff is where most rotation context dies.

A good handoff is a fixed-format document, written by the outgoing on-call, posted in a known channel at a known time. The format below has held up across a lot of teams I’ve seen:

## On-call handoff — week of <date>

### Active issues
- <one-liner per active issue with link to incident or ticket>

### Recent deploys (last 24h)
- <service> — <change description> — <ship date> — <author>

### Watch list
- <thing that hasn't broken yet but is making noises>
- <flaky monitor we're investigating>

### Runbook updates this week
- <runbook> — <what changed>

### Anything I'd want to know if I were you
- <free text>

It takes the outgoing engineer 5–10 minutes to write and the incoming engineer 5 minutes to read. Skipping it costs 20–60 minutes the first time something pages, because the new engineer has to reconstruct context that was already in someone’s head.

For follow-the-sun rotations, this document is mandatory at every boundary. For weekly rotations, write it on Monday morning and update it daily.

Escalation policies

The escalation policy is the rule for what happens when a page goes unanswered.

A workable default for a team of 8 with two tiers:

StepWaitNotify
10primary on-call
2+5 minsecondary on-call
3+10 minengineering manager
4+20 minescalation manager / VP eng

Notice the gap between steps. 5 minutes is enough time for a primary to ack a page they’ve heard but not yet looked at. 10 more minutes is enough to drag in a secondary if the primary is genuinely unavailable. By step 4, you’re 35 minutes into the page, which means something has gone seriously wrong with the rotation itself, not just the incident.

Two common mistakes:

  • Steps too close together. A 1-minute escalation to the manager produces noise; the primary couldn’t even finish reading the alert payload.
  • Last step is “the whole team.” A blast page at the end of the chain incentivises everyone to rely on someone else and trains the team to ignore on-call notifications.

If you ship an alert to a chat channel, pair it with the page — but don’t replace the page with the chat post. Chat is for visibility and collaboration; the page is for guaranteed delivery.

Alert hygiene

The single biggest predictor of on-call sustainability isn’t the schedule shape — it’s how many alerts fire per shift.

A team with 30 actionable alerts a week will burn out regardless of how clever the rotation is. The same team with 3 actionable alerts a week can run an exhausting schedule for a long time without anyone quitting.

The two failure modes:

Too many alerts. Most monitoring deployments accumulate alerts the way old code accumulates TODOs. Every postmortem adds a new alert; nothing ever removes one. After two years, the on-call inbox is unreadable.

The fix is a quarterly alert audit. Every alert should answer three questions:

  1. Is it actionable? If the response is “wait and see,” delete it. That’s a metric, not an alert.
  2. Is it urgent? If you wouldn’t mind seeing it in your morning email, it’s not a page; it’s a ticket. Move it.
  3. Is it correct? If the alert fires more often than the underlying issue actually occurs, retune the threshold or the cardinality.

A good rule: every page during a shift should produce one of three outcomes — fix the underlying issue, write a follow-up to fix it next sprint, or silence the alert because it was wrong. Pages that produce none of these are the kind that drive burnout.

Symptom-based alerting only. Page on user-visible symptoms (latency, error rate, availability) — not on causes (CPU, memory, queue depth). Causes belong on dashboards. Alert on what the user sees and you’ll page on real problems; alert on causes and you’ll page every time the system reroutes itself successfully.

Both sides of the SLO you set become alert thresholds. Page when burn rate is consuming budget faster than allowed. Don’t page on “more than 0 errors in the last minute” — that’s not what your SLO promised.

Compensation and time off

This is the section most engineers care about and most companies underdo.

A few patterns from teams that retain people through years of on-call:

  • Pay for being on-call, not just for incidents. A flat weekly stipend for the rotation is the simplest version. The exact number depends on country and company, but anything below 5% of base pay per on-call week is too low.
  • Time off after heavy weeks. If a shift produced more than two off-hours pages, the next morning is off. No questions, no calendar requests.
  • No on-call during PTO. If someone’s on vacation, they’re off the rotation. The rest of the team picks up the load. If that breaks the rotation, the rotation was understaffed.
  • No on-call for the first 90 days. New hires shadow before they’re paged. Otherwise you’re testing whether your runbooks work on someone with the least context, which is a poor design.
  • Right to swap. Engineers can swap shifts among themselves without manager approval. The schedule is a tool, not a contract.

These are not perks. They are the cost of running an on-call rotation. A team that runs without them is borrowing against future hiring and retention.

How Oack models this

The shape of the on-call system in Oack maps directly to the patterns above:

  • Schedules — weekly, follow-the-sun, weekday/weekend, or arbitrary custom shifts
  • Overrides — temporary swaps without rebuilding the schedule
  • Escalation policies — multi-step with per-step wait times and per-step targets (user, schedule, or webhook)
  • Two tiers — primary and secondary built into the same schedule, no parallel rotations to maintain
  • Mobile push notifications — paging that works without depending on a phone vendor’s flaky Slack integration

The product is opinionated about a few things by design. Escalation steps default to 5/10/20 minutes because shorter intervals never improve the outcome and longer intervals push the page outside the budget for response. Schedules expose the override flow prominently because the friction of swapping shifts predicts whether the rotation gets used or worked around.

Where to start

If you don’t have a rotation today:

  1. Pick the simplest pattern that fits the team — usually weekly, single timezone, primary only.
  2. Define one severity above which someone gets paged. Ship every other alert as a chat message or a ticket.
  3. Write a one-page handoff template. Use it the first week even if it feels overformal.
  4. Set up the escalation policy with three steps: primary, secondary or manager, last resort.
  5. Audit alerts at the end of week one. Delete or downgrade anything that fired but wasn’t actionable.

If you have a rotation that isn’t working, do the alert audit first. Ninety percent of the time, the problem isn’t the schedule shape — it’s the volume of pages, which masks every other failure mode. Cut the page volume in half and you’ll know whether the rotation needs further redesign or just needed quiet.

Start monitoring with Oack

Get TCP telemetry, 5-second alerts, and global coverage — free to start.

Get started free