SLA vs SLO vs SLI: a practical guide for engineering teams
Three letters apart, three completely different meanings. A working definition of each, the math you need to set them up correctly, and the most common mistakes teams make when wiring them to monitoring.
Walk into any engineering review and you’ll hear all three terms used interchangeably. “Our SLA is 99.9%.” “We hit our SLO last quarter.” “What’s the SLI we’re tracking?” By the third sentence, half the room has lost track of what’s actually being measured and what’s just being promised.
Start somewhere realistic. You have no SLIs, no SLOs, no SLAs. You want your service to feel reliable — better than the user expects. Maybe leadership is asking for a real, defensible commitment to quality. Maybe your support inbox is filling up with “is it slow today?” tickets and you want a number you can point at. Either way, you’ve decided to be deliberate about how good “good” needs to be.
The first thing you need is a number. As the line attributed to Peter Drucker goes, “what gets measured gets managed.” To talk about service quality at all you have to pick a dimension that actually matters to the user — request availability, end-to-end latency, page load speed, payment success rate — and start measuring it. Once you have a number, you can compare this month to last, this quarter to last, and answer the only question that matters: are we getting better, the same, or worse? You watch that number. You alert when it degrades. That measurement is the SLI.
Once the SLI is real — measured correctly, monitored continuously, accepted by the team as a fair representation of what users experience — you can start setting expectations against it. What does a good quarter look like? What number on this dimension would you commit to internally, before anyone outside the team is involved? The answer is the SLO. It’s the target you set against the SLI. It doesn’t bind you to a customer yet; it binds you to yourselves.
The third step is when other parties enter the picture. You take an SLO and offer it as a commitment — to a customer, to a partner, to another team in the group. They accept it. You both write down what happens if you miss: service credits, refunds, escalation, a re-negotiation. That commitment, with its consequences attached, is the SLA.
The hierarchy is simple. SLI is the measurement. SLO is the target you set on that measurement. SLA is the contract you make from that target. Conflate them and you’ll either over-engineer the math or, more often, set targets you can’t measure.
This guide gives you a working definition of each, the small amount of math you actually need, and the mistakes I see teams repeat.
SLI — the metric
The Google SRE book defines an SLI as:
A carefully defined quantitative measure of some aspect of the level of service that is provided.
In practice that means an SLI (Service Level Indicator) is a number you can measure right now. It’s a ratio of “good events” to “total events” over some window.
The canonical examples:
| SLI | What “good” means |
|---|---|
| Availability | HTTP requests with status < 500 / total HTTP requests |
| Latency | Requests served in < 500 ms / total requests |
| Freshness | Items processed within 60s of arrival / total items |
| Correctness | Responses passing schema validation / total responses |
| Durability | Objects retrievable after 30 days / total objects |
Two things to notice. First, SLIs are always ratios — never raw counts. “We had 1,200 errors” is not an SLI; “1,200 errors out of 4.8M requests = 99.975%” is. Second, the denominator matters as much as the numerator. If you measure availability as “successful requests / total requests” but exclude requests that never made it to your load balancer, you’ve defined away a real outage.
For a monitoring tool, the SLI is usually whatever the platform measures by default: a successful HTTP probe, a Playwright test that passed, a TCP handshake that completed. Oack’s HTTP monitors emit one event per probe with a status field; aggregate those over a window and you’ve got your availability SLI without writing any code.
SLO — the target
The SRE book again:
A target value or range of values for a service level that is measured by an SLI.
The canonical Google example is “99% of Get RPC calls will complete in less than 100 ms.” An SLO (Service Level Objective) binds three things:
- An SLI (the thing being measured)
- A target percentage (what fraction of events must be “good”)
- A window (over what period)
Example: 99.9% of HTTP requests return < 500 in any rolling 30-day window.
The window is the part teams forget. “99.9% availability” without a window is meaningless — over an hour, an hour of downtime is 0%; over a year, it’s 99.989%. Standard windows are 7 days, 28 days, 30 days, or rolling quarters.
Error budget — the math you actually need
Once you have an SLO, you have an error budget: the amount of “bad” you’re allowed before you’ve broken your target. The math is one subtraction.
For a 99.9% SLO over 30 days:
- Total time = 30 days × 24h × 60m = 43,200 minutes
- Allowed bad time = 0.1% × 43,200 = 43.2 minutes
That’s your monthly error budget. One Saturday-afternoon database failover can eat it. Two and you’ve broken your SLO.
Translated to request volume, the same SLO at 10 million requests/month allows 10,000 failed requests. That number is what your alerts should be calibrated to — if you’re paging on every five-minute spike of 50 errors, you’re paging on 0.5% of your monthly budget. Fine when you’re at 30% used; ridiculous when you’re at 95%.
Sensible SLO values vary by service tier:
| Tier | Typical SLO | 30-day error budget |
|---|---|---|
| Internal tools | 99% | 7h 12m |
| Customer-facing app | 99.9% | 43m |
| Payments / auth | 99.95% | 21m |
| Infrastructure (DNS, CDN) | 99.99% | 4m 19s |
Want to compute the error budget for any SLO target you’re considering? Use our free uptime / downtime calculator — punch in the percentage and see the per-day, per-week, per-month, and per-year numbers.
Don’t pick five nines unless you have the architecture to back it up. A 99.999% SLO leaves you 26 seconds per month — a single TCP retransmit storm can blow it.
SLA — the contract
From the SRE book:
An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
The key test: if there’s no consequence — financial or otherwise — for missing the target, what you have is an SLO, not an SLA. An SLA (Service Level Agreement) is an SLO with money attached. It’s a commitment to a customer, written into a contract, with explicit penalties when you miss.
Two practical implications:
- SLAs are always looser than SLOs. If you commit to 99.9% in a contract, you should be running an internal SLO at 99.95% so you have headroom before refunds kick in. The gap between the two is your “we can have a bad month” buffer.
- Most internal services don’t have SLAs. They have SLOs. SLAs only exist between you and an external party. If you’re talking about reliability targets between two teams in the same company, what you have is an SLO (sometimes called an OLA — operational-level agreement — but the distinction rarely matters).
Most public-facing SLAs offer service credits, not refunds: 10% of monthly fee per hour of downtime over the threshold, capped at 100%. Read your cloud provider’s SLA closely and you’ll see how narrowly the “downtime” is defined — usually requires the entire region to be unavailable, with proof, within 30 days.
Choosing your first SLI and SLO
The hardest part is picking what to measure. Here’s a workable starting point.
Pick the user-visible thing first. If your service is an HTTP API, your first two SLIs are availability (% of 2xx/3xx responses) and latency (% of requests under some threshold). Don’t start with internal metrics like queue depth or CPU — those are causes, not symptoms.
Use the request as your unit, not time. “99.9% availability” measured by request count handles bursty traffic correctly; the same number measured as “minutes of uptime” overweights quiet hours.
Set a threshold you can defend. If your p99 latency today is 800 ms, an SLO of “99% of requests under 500 ms” will fail tomorrow. Either set the SLO at a number you currently meet (and tighten over time) or commit to the engineering work to hit it before turning on alerting.
Pick a sensible window. 28 days is the sweet spot — long enough to absorb weekly traffic patterns, short enough that one bad incident still matters next month.
Wiring SLOs to monitoring
The hand-off from definitions to dashboards is where teams stall. Three concrete patterns:
Use HTTP-monitor results as the availability SLI. Every Oack HTTP probe records status, code, and duration_ms. Aggregate status = ok over a rolling 30-day window and you’ve got an availability number you can put on a dashboard.
Use latency percentiles, not averages. Average latency is the most lied-about metric in monitoring. p99 latency is the latency that 1% of your users experience as the worst case. If your SLO is “99% of requests under 500 ms,” what you’re saying is “p99 latency is below 500 ms.” Oack’s percentile view plots p50/p90/p99/p999 directly so you can see the distribution.
Alert on burn rate, not single failures. Burn-rate alerting compares how fast you’re consuming error budget against how fast the SLO permits. A 14× burn rate over 1 hour is a sev-1 (you’d exhaust your monthly budget in two hours); a 2× burn rate over 6 hours is a sev-2 (still recoverable, but trending bad). This avoids both the “page on every blip” and “find out at end of month we missed SLO” failure modes.
A simple two-window alert that catches most real problems:
- Page if 5-minute burn rate > 14× and 1-hour burn rate > 14×
- Ticket if 1-hour burn rate > 6× and 6-hour burn rate > 6×
Google’s SRE Workbook has the long version of the math; the short version is “look at two windows so a brief blip doesn’t page, but a slow leak still gets noticed.”
Common mistakes
A short list of patterns that show up in almost every retrospective.
Averaging percentiles. You cannot take p99 latency from each minute and average them — the math doesn’t compose. Aggregate raw events first, then compute the percentile across the whole window. If your monitoring tool only stores per-minute p99 values, your numbers are already wrong.
Treating an SLA as the SLO. Running your service against the contract number leaves no headroom. The first bad week means refunds. Always run internal SLOs tighter than external SLAs.
No observation window. “99.9% availability” with no window can be true and false simultaneously. Always include the window when you write the target down.
Picking an SLO you can’t measure. “99.9% of users have a good experience” is not an SLO — there’s no SLI behind it. If you can’t write down the ratio, you don’t have a target, you have a wish.
Using uptime ping as availability SLI. A single check from a single location every 60 seconds gives you 1,440 data points per day. That’s enough resolution for a 99% SLO and not enough for 99.9%. If you’re trying to measure availability tightly, run probes from multiple locations at higher frequency, or use real request volume from your application logs.
Burn-rate alerting that pages on the first request error. If your alert fires the moment one request fails, you’re not alerting on burn rate — you’re alerting on a single event. Configure your alerting on the rate of error budget consumption, computed over a window.
Where to start tomorrow
If you’re starting from zero, do this in order:
- Pick one service. The user-facing API, the checkout flow, or the public marketing site. Don’t try to define SLOs for everything at once.
- Define two SLIs: availability and latency. Write down both as ratios with explicit numerators and denominators.
- Set a 30-day SLO at a number you currently meet. If your last 90 days of data show 99.7% availability, set 99.5% as your first SLO and tighten over time.
- Compute the error budget. Put the number on a dashboard.
- Add burn-rate alerting at two windows (5m/1h and 1h/6h). Tune the thresholds when they’re noisy.
- Review at the end of the window. If you missed the SLO, freeze feature work until you’ve burned down the reliability backlog.
This is the entire SLO playbook. The terminology is the easy part; the discipline of stopping feature work when you’ve exhausted your error budget is what makes it real.
Start monitoring with Oack
Get TCP telemetry, 5-second alerts, and global coverage — free to start.
Get started free