How to run a blameless postmortem (with template)
Blameless doesn't mean no accountability. A working definition, a Markdown template you can copy, and how to run the review meeting so it produces action items that actually ship.
The word “blameless” gets misread two ways. Engineers hear “no accountability” and worry that bad behaviour will go unchecked. Managers hear “no consequences” and quietly resist. Both readings miss the point.
A blameless postmortem isn’t about absolving people. It’s about making the system the subject of investigation instead of the individual. The question shifts from “who pushed the bad config?” to “why was it possible to push a bad config without a check catching it?” The first question gets you a quiet engineer who stops volunteering for risky work. The second gets you a guardrail.
This guide covers when to write a postmortem, the template I use, how to run the review meeting, and how to make sure the action items actually ship.
When to write one
Not every incident deserves a postmortem. The cost of writing one is real — usually 1–2 hours of engineering time across the responder, the writer, and the reviewers, depending on the incident duration and complexity — and if you write them for everything, the bar for what counts as “an incident” rises until people stop reporting near-misses.
A pragmatic rule:
| Trigger | Postmortem? |
|---|---|
| Customer-visible outage > 5 minutes | Yes |
| SLO burn-rate alert that paged | Yes |
| Near-miss caught by automation | Yes (lightweight) |
| Internal-only blip resolved by retry | No |
| Repeated alert from same flaky monitor | No, fix the monitor |
The “near-miss caught by automation” line is the controversial one. Most teams skip these. Don’t. The near-miss is the cheapest postmortem you’ll ever write, because the system’s existing defences already worked — you’re documenting why, so they keep working when the next variant comes along.
The template
Copy this whole block into a new doc when you start a postmortem. Fill in the sections in roughly this order: summary first, then timeline, then dig into root cause and contributing factors, then action items.
# Postmortem: <one-line description of what happened>
**Status:** Draft / In review / Published
**Severity:** SEV-1 / SEV-2 / SEV-3
**Date of incident:** YYYY-MM-DD
**Duration:** <hh:mm>
**Reported by:** Team member / Monitoring / Support / Customer
**Services affected:** <service-a, service-b>
**Incident team:** <names of everyone who actively participated>
**Author:** <name>
**Reviewers:** <names>
**Slack link:** <thread / dedicated channel>
**Status page:** <link to public status page incident>
**Issue tracker:** <link to ticket / epic for follow-up work>
## Summary
<2–4 sentences. What broke, who was affected, how long for, how it was fixed.
A reader who knows nothing about your system should understand the impact from this paragraph alone.>
## Impact
- **Users affected:** <number or %>
- **Requests failed / degraded:** <count and as a fraction of total>
- **Revenue impact (if known):** <amount or "not measurable">
- **SLO budget consumed:** <% of monthly budget>
- **Other systems affected:** <downstream services, customers, etc.>
## Timeline
All times in <timezone>. Use UTC if the responders span timezones.
- HH:MM — <event> (<source: alert / human / log>)
- HH:MM — <event>
- HH:MM — <event>
- HH:MM — Resolved.
## What went well
Things to keep doing. Easy to skip; resist.
- <e.g. responder paged in 90 seconds, runbook was current, rollback worked first try>
## What went poorly
- <e.g. status page update was 20 minutes late, no internal stakeholder comms,
rollback path was undocumented>
## Five whys
<[Read the Wikipedia article.](https://en.wikipedia.org/wiki/Five_whys) In a
nutshell: 5 consequential root-cause localization steps to overcome the
laziness and inertia of our brains and find the real issue we need to address.>
## Action items
| Owner | Action | Type | Due | Tracking |
| ----- | ------ | ---- | --- | -------- |
| <name> | <verb-led description> | prevent / detect / mitigate / process | YYYY-MM-DD | <issue link> |
| <name> | <…> | … | … | … |
Each action must have a single owner. "The team" is not an owner.
## Reference materials
All factual data — screenshots, command outputs, monitoring chart links,
log excerpts, deploy diffs — goes here for later investigation and analysis.
- <link or embed>
- <link or embed>
You don’t need a tool to start running these. A shared Google Doc or a Markdown file in your incidents repo is fine. Once you have a few, you’ll start to see patterns — that’s when an incident management platform earns its place.
Section-by-section guidance
A few notes on the parts that go wrong most often.
Summary
Write this last, even though it appears first. Until you’ve written the timeline and the root cause, you don’t yet know what the summary should emphasise. Re-read it after you’ve finished the rest of the doc.
The summary is the only part most readers will read. It should answer: what broke, who was affected, how long for, what the fix was. No causes, no action items. Save those for the body.
Timeline
The timeline is the spine of the document. It should be a flat list of events with timestamps, sourced from real evidence wherever possible: alert payloads, Slack messages, deploy logs, audit trails. If you find yourself writing “around 14:20 someone noticed…” go look up the actual minute. Approximate timelines invite approximate analyses.
A common mistake is to include only the response timeline (alert → ack → fix). Include the cause timeline too: when the bad change shipped, when the latent bug became reachable, when traffic crossed the threshold. The gap between cause and detection is often the most useful number in the whole document.
Root cause vs contributing factors
Most outages have one root cause and several contributing factors. The root cause is the thing that, if it hadn’t happened, the incident wouldn’t have occurred. Contributing factors are the things that made the impact worse or recovery harder.
Bad: “Root cause: deploy went out with a bad config.”
Better: “Root cause: a config change to the database connection pool reduced the per-pod max from 100 to 10. At peak traffic, this caused connection exhaustion, returning 500s for the affected service. Contributing factors: (1) the change passed CI because no test exercises connection pool limits; (2) the canary stage saw only 1% of traffic, below the level at which exhaustion manifests; (3) the runbook for “connection pool exhaustion” pointed to a service that no longer exists.”
Five whys
If you only pick up the action items that lie on the surface and skip the deep dive, you can’t build a high-quality service. You’ll fix the symptom, ship a patch, close the postmortem, and watch a variant of the same incident appear three months later in a different service.
Five whys is the simplest tool for forcing yourself past the surface. The technique is exactly what it sounds like: take the answer to “why did this happen?” and ask “why?” again. Then again. Five times is a heuristic, not a quota — sometimes you stop at three, sometimes you keep going to seven. The point is to overcome the natural laziness and inertia of our brains, both of which want to settle on the first plausible answer because that’s where the cognitive load drops off. (Read the Wikipedia article for the long version.)
A worked example. The connection pool incident from above:
- Why did the API return 500s? Connection pool exhaustion in the primary region.
- Why was the pool exhausted? A config change reduced per-pod connections from 100 to 10.
- Why did that change ship? It passed CI and the 1% canary stage without triggering any signal.
- Why didn’t CI catch it? No test exercises connection-pool limits; CI runs against a single-pod sandbox where a 10-connection cap is plenty.
- Why doesn’t CI test against realistic load? Our load-testing rig lives in a separate repository, owned by a different team, and isn’t wired into the deploy pipeline.
The first answer points at the patch (raise the pool size). The fifth answer points at something architectural: the team that owns the deploy pipeline doesn’t own the load-testing rig, and nobody is responsible for making them talk to each other. Without going five steps deep, you fix the connection pool and miss the org-design problem that will produce the next three incidents.
This is where the most valuable findings in any postmortem usually live. The wrong architecture, the wrong design pattern, the wrong org chart, the wrong scope of responsibility — these don’t sit on the surface. They reveal themselves only when you keep asking why past the point where the answer feels “good enough.” From my perspective this is the most valuable single thing in the postmortem doc.
The discipline that makes it work: stop when an honest answer would be “because humans are fallible” or “because we couldn’t predict that.” At that point you’re moralising, not investigating, and any further “why” is going to produce action items aimed at people instead of systems.
Action items
This is where most postmortems leak value. The action items list is long, owners are vague, deadlines are missing, and six months later nothing has shipped.
Three rules that fix it:
- One owner per action. Always a single name, never a team or “TBD.” If you don’t know who, the action is ”
finds the right owner by ” — which is itself an action. - Categorise the action. Prevent (stops the cause from recurring), detect (catches the same cause faster next time), mitigate (limits blast radius), or process (changes how the team operates). A postmortem with only “prevent” actions is incomplete; you can’t prevent every cause, so you also need detection and mitigation.
- Track them in your normal issue tracker. Linking from the postmortem to the issue keeps the work visible alongside other engineering work. Action items that live only inside the postmortem document die there.
A useful sanity check at the end of the document: read each action item out loud and ask “if all of these shipped, would the same incident have a smaller impact next time?” If the answer is unclear, the actions are too vague.
The review meeting
The postmortem document is the artefact; the review meeting is where the team aligns on what it means.
Run it as a 45-minute working session, not a presentation. The author drives, but the goal is for the responders, the people whose code or config caused the incident, and at least one engineer not involved in the incident to read through together and surface anything missing.
A workable agenda:
- 5 min — author reads the summary and impact aloud
- 15 min — walk through the timeline; pause when anyone has questions
- 15 min — discuss root cause and contributing factors; this is where most learning happens
- 10 min — review action items; assign or re-assign owners; set deadlines
Two practical norms make these meetings safer.
Use the document, not the people. When something looks like a mistake, point at the document’s text and ask “should this section say X?” rather than “did you do Y?” This sounds artificial; it works.
The author’s first draft is meant to be wrong. Make it explicit that the room is going to find missing context, wrong timestamps, and weak action items. If the document survives review unchanged, the review meeting failed.
After the meeting, the author updates the document, marks it published, and links it from the relevant Slack channel. Anyone with read access to your engineering docs should be able to find it.
What “blameless” means in practice
Three concrete behaviours separate blameless cultures from the rest:
No naming individuals as causes. Use roles or systems. Not “Alice deployed a bad config”; instead, “the on-call engineer deployed a config that passed CI but failed at runtime.” This is not about hiding who did what — the timeline and Slack history make that obvious — it’s about keeping the analysis focused on systems.
Assume good intent. Write as if every action taken during the incident made sense given what the responder knew at the time. If a decision in retrospect looks bad, the question is “what information would have led to a better decision, and why wasn’t it available?”
No personal action items. “Alice will be more careful” is not an action item. The action item is the system change that catches the next Alice — a CI test, an additional deployment gate, a runbook update.
The bar isn’t “we never blame anyone.” It’s “the incident becomes input to system improvement, not input to a performance review.”
How Oack helps
If you’re using Oack for incident management, the postmortem template is built in. The timeline auto-fills from monitor events, alert deliveries, escalation steps, and any Slack updates posted to the incident channel. The summary, impact, and root-cause sections are AI-pre-filled from the timeline and the linked monitor data — you edit, you don’t start from blank. Action items live alongside the rest of your incident’s metadata, and the published postmortem is reachable from the incident’s permalink for as long as you keep the incident on file.
The pre-fill saves about 60% of the writing time on the postmortems we see go through the system. The reviews still take the same 45 minutes — and that’s the part you don’t want to skip.
What you’ll get out of this
Three months of running postmortems with this template will surface things your monitoring won’t. Patterns: the same pre-deploy gap shows up across multiple incidents, the same on-call handoff window keeps producing slow detection, the same internal service keeps being a contributing factor. Those patterns are the input to a reliability roadmap.
A reliability roadmap is what postmortems are for. The individual document matters; the cross-incident analysis is where the leverage is.
Start monitoring with Oack
Get TCP telemetry, 5-second alerts, and global coverage — free to start.
Get started free