On-call rotation design: schedules, escalation, and avoiding burnout
Designing an on-call rotation that keeps customers happy and engineers sane. Schedule patterns, escalation policies, alert hygiene, compensation, and the day-to-day handoff doc.
Engineering insights, product updates, and monitoring best practices.
Designing an on-call rotation that keeps customers happy and engineers sane. Schedule patterns, escalation policies, alert hygiene, compensation, and the day-to-day handoff doc.
The job of a status page, the five stages of an incident update with copy you can use, the words that lose customer trust, and the cadence that keeps support inboxes from drowning.
Pingdom, Better Stack, Checkly, UptimeRobot, Datadog, Oack, and more. What each is great at, what each is bad at, and a four-question decision tree to actually pick one.
Blameless doesn't mean no accountability. A working definition, a Markdown template you can copy, and how to run the review meeting so it produces action items that actually ship.
Three letters apart, three completely different meanings. A working definition of each, the math you need to set them up correctly, and the most common mistakes teams make when wiring them to monitoring.
A curated list of books and resources for Site Reliability Engineers — from TCP/IP internals to Linux systems programming. What Google recommends, what I actually studied, and why this knowledge translates directly into higher uptime.
A practical history of browser automation — Selenium, Puppeteer, Playwright — and how the same technology now powers synthetic monitoring that catches what HTTP checks miss.
Set up Oack's MCP server in Claude Code, Claude Desktop, Cursor, or Windsurf in under a minute. Plus: use oackctl from any agent that can run shell commands, and leverage llms.txt for context-aware assistance.
When response times spike, how do you find the bottleneck? Walk through a systematic approach using HTTP timing fractions, TCP metrics, server headers, and percentile analysis to isolate whether the problem is on the network, the CDN, or the origin.
Most monitoring tools stop at HTTP status codes. Oack now captures TCP-level metrics — RTT, retransmits, congestion window — so you can diagnose network issues before they become outages.
Cloud monitoring from cloud regions misses the problems your users actually face. Here's why we built a network that lets you monitor from your own infrastructure.