Abstract

SLO talks usually come from companies with a dozen SREs and a dashboard team. This one was the opposite: I walked through how we ran error budgets on a platform team of three, with two on-call rotations and no dedicated observability person. The thesis: you can get 80% of the benefit of a “proper” SLO program if you are willing to make a few unglamorous tradeoffs, and that last 20% is what burns your team out.

Outline

  1. What we actually wanted out of SLOs (hint: not dashboards)
  2. Picking a target with a coin flip is not the worst choice
  3. The three SLIs we ended up with: availability, latency p95, and “data staleness for the read path”
  4. Burn-rate alerts instead of threshold alerts, why we switched
  5. The monthly review: 30 minutes, three humans, one Google Doc
  6. When we cheated and it was fine
  7. What we would not do again

What I learned giving it

The SRE crowd at this conference was much more pragmatic than I expected. The question that got the most energy was “how do you handle the quarter where you burned the whole budget in week two?” I had a real answer (we stopped shipping and fixed things, exactly like the book says) but the follow-up question “what if product won’t let you stop?” was harder and I mostly punted on it.

I also underestimated how many people in the room had never written a burn-rate alert. That should have been 10 minutes with an example, not one slide.

What I’d change

  • Replace the theory section with a single walkthrough of “here is the PromQL for one real alert, let me explain each clause.”
  • Acknowledge earlier that small teams sometimes just do not have the luxury of stopping ship, and what that failure mode looks like.
  • Drop the joke about the four golden signals. It did not land and I knew it halfway through.

Related posts: /posts/slo-math-for-tired-engineers/, /posts/grafana-alert-fatigue-cull/, /posts/otel-tail-sampling-that-works/.

The conference recorded it but the audio on the first five minutes is rough. I do not have a link to share; the conference site reshuffled after 2024 and the video seems to have dropped off.