Sre
-
CI dependency graph for a monorepo that doesn't run everything
The core trick that cut our PR CI time from 22 minutes to 4 on average.
-
We tried buildpacks. I don't recommend them for most teams.
A six-month experience report. The value proposition is real; the downsides are too.
-
SLO math for tired engineers
Enough formulas to write real alerts without spending a weekend in a textbook.
-
Structured logging lessons from four years of zerolog
Structured logging is not the hard part. The hard part is everything around it.
-
Head vs tail sampling: the mental model I wish I'd had
I conflated these for years. Here is a cleaner way to think about them.
-
strace revealed our libc mismatch
A service worked on one image and not another. The difference was invisible until we traced syscalls.
-
Debugging a remote core dump without losing your mind
A core dump from production is a gift. Here is how I unwrap it.
-
Dev containers at 30 engineers: the unglamorous middle
Dev containers solve real problems but have their own operational tail.
-
Redis maxmemory, eviction, and the day we served stale for 20 minutes
noeviction is the default, and the default is dangerous when you thought you were running a cache.
-
We culled 40% of our alerts and nothing bad happened
A retrospective on how our team finally beat alert fatigue.
-
Autovacuum tuning, one table at a time
Global autovacuum settings are a lie. I tune per-table now.
-
Logical replication slot lag ate our WAL
A forgotten logical replication slot accumulated 380GB of WAL before we caught it. Here's what we changed.
-
The label that killed Prometheus
One innocuous request_id label, 18M active series, and a very bad Friday.
-
Tail sampling that actually saved money
Head sampling is simple. Tail sampling works. Here is a config we run in production without sadness.
-
pgbouncer transaction pooling broke our prepared statements
A multi-day outage-adjacent incident caused by prepared statements not making it across pool boundaries.