Back to blog
| Gregory Komissarov

Public status pages: what to say (and not say) during an outage

The job of a status page, the five stages of an incident update with copy you can use, the words that lose customer trust, and the cadence that keeps support inboxes from drowning.

guidestatus-pageincidentscommunication

A public status page does three jobs. It tells your users what’s happening so they stop refreshing. It deflects support tickets so your customer success team isn’t drowning in “is it down?” emails. And it builds long-term trust by showing you’re honest when things break.

Most status pages fail at the third job — sometimes at all three — because the writing is bad. Vague language, no timestamps, premature ETAs, no follow-up. Customers learn the page is theatre and stop checking it. The next time you have an outage, your support inbox fills up anyway.

This guide is the part status-page tools don’t ship: what to actually write, when to write it, and what never to write.

The shape of a good incident update

Every status update has the same components. They look obvious until you read a real status page and see how often they’re missing.

ComponentPurposeSkipped how often
Timestamp (with timezone)Tells the reader when this is trueAlmost always present
Affected scopeWhich users or features are affectedOften vague
What’s happeningWhat we know is brokenUsually present
What we’re doingWhat action we’re takingSometimes present
Workaround (if any)What the user can do nowRarely present
Next update timeWhen to come backAlmost never present

The “next update time” is the highest-leverage missing piece. If you tell users “next update in 30 minutes,” they go away for 30 minutes. If you don’t, they refresh every 90 seconds and then call support. A pre-committed update cadence is a dam against your inbox.

The five stages

Every public incident moves through five stages. Each has its own template and its own discipline.

1. Investigating

Posted within 5 minutes of confirming the incident. The goal is to acknowledge before anyone has to ask.

[14:02 UTC] Investigating
We're seeing elevated error rates on the API. Affected users
may experience failed requests on /v1/checkout.
We're investigating now. Next update in 15 minutes.

What this update includes:

  • A timestamp with timezone
  • A specific affected component (/v1/checkout), not “the API”
  • A user-visible symptom (“failed requests”), not a cause
  • An explicit next-update commitment

What it deliberately does not include: a root cause, an ETA, or any speculation. You don’t know any of that yet.

2. Identified

Posted when you have a working hypothesis about the cause. It does not need to be the final root cause — it just needs to be specific enough that you’re not guessing in public.

[14:18 UTC] Identified
The elevated error rate is caused by a database connection
pool exhaustion in our primary region. Roughly 40% of /v1/checkout
requests are returning 503s. We're scaling the pool now.
Next update in 15 minutes.

This is the update where customers form their opinion of how you handle outages. Notice what’s still missing: an ETA. Don’t promise one until you’ve watched the fix work.

3. Monitoring

Posted when you’ve shipped a fix and you’re watching for recovery.

[14:33 UTC] Monitoring
Connection pool has been increased and error rates are returning
to normal. We're monitoring to confirm full recovery.
Next update in 15 minutes if we're still on this stage,
otherwise a "resolved" update.

This stage protects you against celebrating early. If error rates jump again 4 minutes after you posted “resolved,” you’ve just shipped a worse comms outcome than the original incident.

4. Resolved

Posted when error rates have been within normal bounds for at least 10–15 minutes — long enough that you’d notice another regression.

[14:48 UTC] Resolved
The incident has been resolved. /v1/checkout is operating
normally. A full postmortem will be published within 5 business days.
Total user-visible duration: 14:02–14:35 UTC (33 minutes).

Two things to include here that most teams forget: the total duration with bounds, and the commitment to publish a postmortem. The duration prevents the “how long was that down?” support flood. The postmortem commitment is the trust-building moment.

5. Postmortem (linked separately)

Within 5 business days, link the postmortem from the resolved incident. Most status page tools support attaching a postmortem URL to a closed incident; if yours doesn’t, edit the resolved-update text to add a link once the postmortem is published.

The postmortem doesn’t need to be on the status page itself — a blog post or shared doc is fine. What matters is the discoverable link from the place a customer first looked.

Words that lose trust

A short list of phrases that show up on bad status pages and what they look like to a reader.

“Our team is aware.” This is reflexive. Of course you’re aware; you posted the update. Skip it.

“We’re working hard to resolve.” Empty calories. Replace with the specific action: “we’re scaling the connection pool” or “we’re failing over to the backup region.”

“A small number of users.” If you can’t say “about X% of users” or “users on plan Y,” then you don’t yet know who’s affected — say so explicitly: “we’re identifying which users are affected.” The vague version reads as evasion.

“Most users are unaffected.” Same problem from the other direction. The customers reading the page are by definition affected; telling them other people are fine is not comforting.

“Should be resolved shortly.” “Shortly” is meaningless. Either give a real ETA (which you should not unless you’re confident) or say “we’ll update in N minutes.”

“Apologies for any inconvenience.” Boilerplate. Either say something specific (“we know this blocks customers from completing purchases — we’re treating it as a sev-1”) or don’t apologise at all in the update. Save the apology for the resolved update or the postmortem, where you can be more specific.

“Issue resolved.” No bounds, no specifics. “Issue resolved” leaves the reader to guess what’s actually working again.

Things never to write in a real-time update

Some content belongs in the postmortem only, never in real time:

  • The name of an individual. Even if it’s accurate that “Alice’s deploy” caused the incident, the public update is not where that goes.
  • The name of a vendor as a cause. “Our cloud provider had an issue” is fine if it’s true and you can prove it. “AWS is broken” while the incident is in progress invites a Twitter storm and may turn out to be wrong. Wait for the postmortem.
  • Speculation about future incidents. “This shouldn’t happen again” is something you can defend in a postmortem, after you’ve shipped the fix. In real time, it’s a promise you can’t keep.
  • Internal technical jargon. “Pod eviction loop in the staging cluster cascading to prod” is what the engineers see in Slack. The status page reader needs “the checkout service is failing because a related service is overloaded.” Translate.

Update cadence

The single most important commitment is “next update in N minutes.” The number depends on incident severity:

SeverityUpdate cadenceEven if no progress
Sev-1 (full outage of core feature)Every 15 minutesYes
Sev-2 (partial outage, degraded performance)Every 30 minutesYes
Sev-3 (minor / single-feature)At each stage transitionYes

The “even if no progress” column matters. A “we’re still investigating, no new information, next update in 15 minutes” message is a feature, not a failure. It tells the reader you haven’t forgotten about them, and it deflects the support ticket they were about to file.

Set a calendar timer when you post “next update in 15 minutes.” Status pages with consistent cadence get checked back; status pages with sporadic updates get assumed-broken.

The subscribe flow

Most status pages have a subscribe button. Most teams treat it as a configuration toggle and never look at it again. A few details that materially change adoption.

Make subscription one click for the channels users actually want. Email and SMS are table stakes. Slack/Teams webhooks are increasingly expected for B2B. If the only subscribe option is “enter your email,” half the people who would subscribe won’t.

Make unsubscription as easy as subscription. A user who subscribed during a major incident and forgot will eventually receive a “minor degradation” notification at 03:00. If the unsubscribe flow is friction-free, they’ll go back to the subscribe button next time. If it’s painful, they’ll mark the email as spam and you’ve lost the channel.

Send notifications at every state change, not just at the start. A subscriber who got the “investigating” email and then heard nothing will assume they missed the resolution. Notify on identified, monitoring, and resolved as well.

Don’t notify on every minor sub-component. Most status page tools let you split components (“API,” “Dashboard,” “Background Jobs”). Subscribers should be able to choose which components they care about, and component groupings should match what the customer can perceive — a “checkout” component grouping that includes API, dashboard, and background-job sub-components is more useful than three independent notifications.

Maintenance windows

Scheduled maintenance is a different animal from incidents. The patterns:

Pre-announce on a longer window. Maintenance affecting paid users should be announced at least 7 days in advance. 24 hours’ notice may meet the legal threshold; it doesn’t meet the customer-trust threshold.

Post the maintenance as an “in progress” update at the start time. Even if everything goes smoothly. Customers who didn’t see the pre-announcement should be able to find the in-progress entry.

Announce the buffer. “Maintenance scheduled 02:00–04:00 UTC, expected duration 30 minutes.” The buffer between expected and scheduled is your “things might run long” margin and signals competence.

Resolve the maintenance the same way you resolve an incident. “Maintenance complete at 02:42 UTC, total duration 42 minutes. Service is operating normally.” Don’t leave it open or “in progress” indefinitely.

Public vs internal status pages

A surprisingly common question: should we run two status pages?

Run one public page for the customer-visible services. Public means scoped to user impact, written in user language, with enough information to deflect support load and earn trust.

Run a separate internal page for engineering-facing components: build pipelines, internal APIs, observability tooling, the dependency status of vendors you rely on. This page can be more candid, more technical, and updated more frequently. It’s also where you put status of the things that would affect customers if they got worse, so internal teams can see leading indicators.

Don’t try to merge them. The public page must be safe to over-share; the internal page must be safe to be specific in. Different audiences, different writing standards.

If you’re a small team and only want to run one page, run the public one. Internal status will end up in chat or in incident channels naturally.

Templates to copy

A short library of message starts you can adapt:

Investigating, scope unknown:

[HH:MM UTC] Investigating — We’re seeing reports of . We’re identifying which services and users are affected. Next update in 15 minutes.

Investigating, scope known:

[HH:MM UTC] Investigating — Approximately <X%> of users are experiencing on . We’re investigating now. Next update in 15 minutes.

Identified:

[HH:MM UTC] Identified — The cause is . Approximately <X%> of users on are affected. We’re . Next update in 15 minutes.

Identified with workaround:

[HH:MM UTC] Identified — . As a temporary workaround, users can . Next update in 15 minutes.

Monitoring:

[HH:MM UTC] Monitoring — We’ve deployed and metrics are returning to normal. We’re monitoring for stability before resolving.

Resolved with bounded duration:

[HH:MM UTC] Resolved — is operating normally. User-visible duration was HH:MM–<HH:MM UTC> ( minutes). A postmortem will be published within 5 business days.

Resolved with no impact:

[HH:MM UTC] Resolved — Investigation complete. After review, we found was operating normally throughout this period. The initial reports appear to have been caused by . No user impact identified.

How Oack handles this

If you’re using Oack for status pages, the five-stage model is the default flow: declaring an incident from a monitor failure pre-fills the affected component, the timestamp, and the suggested first message. Subscribers receive notification at every state change and can scope their subscription per-component. Public and internal status pages are two separate page types that share the same incident backend, so an incident in your internal page can be promoted to public with a click without losing history.

The bit nobody else does well: the “next update in N minutes” commitment is a first-class field. Set it when you post the update; the system reminds you when the timer runs out, even if no new information has arrived. The reminder is the difference between a status page that earns trust and a status page that decays into theatre.

Where to start

If your status page exists but isn’t working:

  1. Audit your last three incidents. For each, count how many of the five stages were posted and how many had a “next update in N minutes” commitment. The result is usually 1–2 out of 5, with no commitments. That’s your baseline.
  2. Write the templates above into your incident runbook. When the next incident happens, you copy and adapt rather than write from scratch under pressure.
  3. Set the cadence. Decide what sev-1 and sev-2 update intervals are. Put them in the runbook.
  4. Commit to a postmortem link from every resolved update. The first one you ship is the proof.

If you don’t have a status page yet, the order of magnitude rule: a public status page becomes worth running when you have more than 50 paid users or one paid user who will renegotiate the contract over an outage. Below that, an email to affected users is fine.

The status page is communication infrastructure, not a marketing surface. Treat it like the rest of your engineering infrastructure: with templates, cadences, and a runbook.

Start monitoring with Oack

Get TCP telemetry, 5-second alerts, and global coverage — free to start.

Get started free