SLO math for tired engineers
Every time I set up SLO-burn-rate alerts, I have to re-derive the math, and I forget some detail, and the alert is either too sensitive or too chill. So here is the cheat sheet, written for my future self and anyone else who just wants to get this right without reading the entire SRE book this weekend.
The setup
You have an SLO. Let’s say 99.9% availability over 30 days. That means you’re allowed to be unavailable for 0.1% of the time, which is 30 * 24 * 60 * 0.001 = 43.2 minutes of downtime per month.
That 43.2 minutes is your “error budget.” You can spend it however you like, but once it’s gone, you’re out of SLO.
Burn rate
Burn rate is “how fast are you burning through the budget relative to the rate that would exhaust it exactly at the end of the window.”
Burn rate of 1: at this rate, you’ll exhaust your budget in exactly 30 days. Burn rate of 10: you’ll exhaust it in 3 days. Burn rate of 100: you’ll exhaust it in 7 hours.
The alert question is: at what burn rate, for what duration, should I page?
The Google SRE recipe
The Google SRE book recommends multi-window multi-burn-rate alerts. The canonical pairings are:
| Severity | Burn rate | Long window | Short window |
|---|---|---|---|
| Page | 14.4 | 1 hour | 5 minutes |
| Page | 6 | 6 hours | 30 minutes |
| Ticket | 3 | 24 hours | 2 hours |
| Ticket | 1 | 72 hours | 6 hours |
Why two windows? The long window decides “yes, this is a real sustained problem.” The short window decides “this problem is happening right now, not an hour ago.” You want both to agree before you page.
Why those specific numbers? The Google book derives them from a notional “I want to alert if we’d burn 2% of our budget in the long window.” 14.4x burn rate for 1 hour = 1hr * 14.4 / (30 * 24) = 2% of month. 6x for 6 hours = 6hr * 6 / (720hr) = 5%. Etc.
The actual PromQL
For an availability SLO of 99.9% where “good” is HTTP status 2xx/3xx/4xx (we count 4xx as success, they’re client errors):
# 1 hour burn rate
(
1 - sum(rate(http_requests_total{code=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
# 5 minute burn rate
(
1 - sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
Combined:
(burn_rate_1h > 14.4 * 0.001) and (burn_rate_5m > 14.4 * 0.001)
In Prometheus rule file:
groups:
- name: slo-burn-rate
rules:
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])) > 14.4 * 0.001
)
and
(
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 14.4 * 0.001
)
for: 2m
labels: { severity: page }
The for: 2m is defensive — even if the instantaneous rates flicker above and below threshold, we want a few minutes of sustained signal.
Latency SLOs
For latency, the definition of “good” is different. Something like “p95 under 500ms over the period.” We typically use bucket-based SLIs:
# fraction of requests that took longer than 500ms in the last hour
(
sum(rate(http_request_duration_seconds_count[1h]))
- sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
) / sum(rate(http_request_duration_seconds_count[1h]))
This gives you a fraction. Treat that fraction like the “error rate” in the above templates.
What tired me out the first time
The thing I always lose track of is the conversion between “budget spent per time window” and “instantaneous error rate threshold.” The formula is:
error_rate_threshold = target_error_fraction * burn_rate
Where target_error_fraction is 1 - SLO, so 0.001 for 99.9%.
So for a burn rate of 14.4 at SLO 99.9%, your instantaneous error rate needs to be above 14.4 * 0.001 = 0.0144, i.e. 1.44% of requests erroring.
That’s why the “14.4 * 0.001” shows up everywhere. It’s just burn_rate * (1 - SLO).
Common pitfalls
- Very low traffic periods give noisy burn rate. If you only served 10 requests and 1 errored, your error rate is 10%, burn rate is 100. Almost any alert fires. For low-traffic services, either combine with a “minimum volume” check or use a longer window.
- SLOs measured over calendar months create month-boundary weirdness. Prefer rolling windows. 30 rolling days is much more forgiving than calendar months.
- Multi-dimensional SLOs are hard. “99.9% availability per route” sounds reasonable until you realize you have 200 routes. Either aggregate (whole-service SLO) or tier (core routes have SLOs, rest don’t).
Reflection
SLO math is 80% “understand burn rate,” 10% “know the recipe,” 10% “don’t build dumb rules.” The textbooks make it seem harder than it is because they’re trying to cover every case. Most teams just need the two burn rate alerts I listed at the top. If that’s you, copy those, set your target_error_fraction correctly, and ship.
Related: We culled 40% of our alerts and nothing bad happened.