At the start of the year our on-call was miserable. I was averaging 6-8 pages per night on my rotations. Most were noise. A few were genuinely important. We were losing the signal-to-noise game and nobody was sleeping.

We ran a 3-week project to fix it. I want to write down what worked because the usual “use SLOs” advice is correct but unactionable without more detail.

The exercise

Every alert in the system had to justify itself. For each one, we asked four questions:

  1. When this fires, what does the human do? If the answer is “check the dashboard and go back to sleep,” the alert shouldn’t exist.
  2. What’s the consequence if we don’t respond for 30 minutes? If the answer is “nothing user-visible,” it’s not an alert, it’s a ticket.
  3. Has this fired in the last 90 days, and was the action taken meaningful? If it fired 40 times and was ignored 39 times, we either tune it or delete it.
  4. Is this downstream of another alert that would have fired first? If yes, delete this one, rely on the upstream.

That’s it. We printed them on sticky notes. We stuck them on the wall.

What we deleted

Out of 234 alerts, we deleted 91. That’s 39%. Rough categories:

  • Host-level alerts (CPU, memory, disk) that duplicated what our service-level SLO alerts would catch. Kept one per host, deleted the rest.
  • Alerts on individual services in redundant clusters. If one Redis replica goes down, the cluster health alert covers it.
  • “Warning” tier alerts that had never once been acted on. If they’re never acted on, they’re not alerts.
  • Alerts on dependencies we didn’t own. A third-party API having an outage will show up in our error rate; we don’t need an alert on the third-party’s own status page.
  • A handful of alerts that were, on inspection, buggy. They had been firing because of the bug, not because of a real condition. Fun.

What we kept but tuned

About 50 alerts kept firing unnecessarily but shouldn’t be deleted. For these we did one of:

  • Raised the threshold. Most alerts had been written when the system was smaller and the threshold was set for that scale. A p99 of 800ms was a problem in 2019 when our SLA said 500ms; now our SLA says 1500ms and 800ms is fine.
  • Added a time filter. “Page me if this is true for 15 minutes” instead of “page me on any spike.” Short blips don’t hurt users, sustained issues do.
  • Converted to SLO-burn-rate alerts. The Google SRE book formulation (multi-window multi-burn-rate) catches real issues with dramatically fewer false positives than simple thresholds.

Example of the last one. Instead of:

avg_over_time(http_request_duration_seconds:p99[5m]) > 1.5

We use:

( 1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) > 14.4 * 0.001
AND
( 1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) ) > 14.4 * 0.001

That is: error rate over the last hour AND last 5 minutes are both burning at >14.4x the rate that would exhaust our monthly SLO budget. Two-window burn-rate alerts catch real problems and ignore short spikes.

What we added

We added a weekly report of alert counts, by category, emailed to the team. Seeing the number go from “6 pages per night average” to “0.4 pages per night average” was motivating. Seeing it tick up week-over-week now triggers an investigation.

We also added a monthly alert review where we look at any new alerts created in the last month and ask the four questions above. This is institutional memory that prevents backslide.

The on-call consequences

Our median page count dropped from ~6/night to ~0.4/night. Our MTTR went up slightly — we pay more attention to each page now, so we investigate more thoroughly — but mean time to detect went down because people don’t ignore the pager.

Six months in, nothing has gone horribly wrong that wasn’t caught by the surviving alerts. Two incidents did take slightly longer to detect than they would have under the old regime, because we no longer had a “canary alert” that fired on any slight anomaly. We decided that tradeoff was fine.

The one thing I’d do differently

I’d involve product and customer support earlier. Some of the alerts I wanted to delete turned out to be proxies that support was using to know when customers would start complaining. Deleting them would have meant support finding out from customers. We kept a few of those, converted them to lower-priority “ticket” alerts that don’t page.

Related: SLO math for tired engineers goes deeper on the burn-rate alert setup.