Observability
-
SLO math for tired engineers
Enough formulas to write real alerts without spending a weekend in a textbook.
-
Structured logging lessons from four years of zerolog
Structured logging is not the hard part. The hard part is everything around it.
-
Head vs tail sampling: the mental model I wish I'd had
I conflated these for years. Here is a cleaner way to think about them.
-
We culled 40% of our alerts and nothing bad happened
A retrospective on how our team finally beat alert fatigue.
-
Flamegraphs in production without the fear
I used to be scared of perf record in prod. Then I wasn't.
-
The label that killed Prometheus
One innocuous request_id label, 18M active series, and a very bad Friday.
-
Tail sampling that actually saved money
Head sampling is simple. Tail sampling works. Here is a config we run in production without sadness.
-
TIL: Grafana has a 'custom all value' for variables
When $var=All, Grafana was sending literally the string 'All' to my query. There's a setting for that.
-
TIL: histogram_quantile needs a rate, not a raw counter
Forgetting the rate() inside histogram_quantile gives you weird-looking percentiles. Here's why.
-
Reading goroutine traces like a local
How to read the output of runtime/trace and actually understand why your service is slow.
-
TIL: span events are the right place for timestamped annotations
Logs-within-a-span are a first-class OpenTelemetry concept and I had been putting them in attributes.
-
Finding TCP retransmits with bpftrace
A short bpftrace script that pinpoints which process and peer are responsible for TCP retransmits on a noisy box
-
TIL: OpenTelemetry has a separate 'Baggage' concept
Span attributes vs baggage — they're different and I had been confusing them.
-
An operator reconcile loop that wouldn't quit
An operator kept thrashing at 300 reconciles per second, and the bug was a single annotation I was setting on the managed resource
-
TIL: pg_stat_user_tables.n_mod_since_analyze is a thing
Shows exactly how many rows have changed since the last ANALYZE. Great for 'do I need to analyze?'