SLO math for tired engineers
Abstract
SLO talks usually come from companies with a dozen SREs and a dashboard team. This one was the opposite: I walked through how we ran error budgets on a platform team of three, with two on-call rotations and no dedicated observability person. The thesis: you can get 80% of the benefit of a “proper” SLO program if you are willing to make a few unglamorous tradeoffs, and that last 20% is what burns your team out.
Outline
- What we actually wanted out of SLOs (hint: not dashboards)
- Picking a target with a coin flip is not the worst choice
- The three SLIs we ended up with: availability, latency p95, and “data staleness for the read path”
- Burn-rate alerts instead of threshold alerts, why we switched
- The monthly review: 30 minutes, three humans, one Google Doc
- When we cheated and it was fine
- What we would not do again
What I learned giving it
The SRE crowd at this conference was much more pragmatic than I expected. The question that got the most energy was “how do you handle the quarter where you burned the whole budget in week two?” I had a real answer (we stopped shipping and fixed things, exactly like the book says) but the follow-up question “what if product won’t let you stop?” was harder and I mostly punted on it.
I also underestimated how many people in the room had never written a burn-rate alert. That should have been 10 minutes with an example, not one slide.
What I’d change
- Replace the theory section with a single walkthrough of “here is the PromQL for one real alert, let me explain each clause.”
- Acknowledge earlier that small teams sometimes just do not have the luxury of stopping ship, and what that failure mode looks like.
- Drop the joke about the four golden signals. It did not land and I knew it halfway through.
Related posts: /posts/slo-math-for-tired-engineers/, /posts/grafana-alert-fatigue-cull/, /posts/otel-tail-sampling-that-works/.
The conference recorded it but the audio on the first five minutes is rough. I do not have a link to share; the conference site reshuffled after 2024 and the video seems to have dropped off.