Debugging
-
Debugging DNS in a kind cluster
CoreDNS inside a kind cluster could resolve cluster names but not external ones, and the problem was in the host's resolver, not k8s
-
btrfs scrub found what smartctl missed
A disk that passed every SMART test had silent corruption in an old set of blocks, and btrfs scrub caught it two weeks before it would have mattered
-
The OOM killer picked postgres. Here's why.
A staging VM's OOM killer reliably picked postgres over a misbehaving test process, and the fix was understanding oom_score_adj
-
MTU, MSS, and a VPN that couldn't stream video
A perfectly-working WireGuard tunnel that failed only on video streams, and the diagnosis that made me finally understand MSS clamping
-
A TLS SAN quirk that broke mTLS
An internal service stopped accepting a client cert after a seemingly innocent renewal, and the issue hid in the SAN encoding
-
When journald ate my disk
A misconfigured journal rate-limit and a noisy process combined to fill a 100 GB disk in a week, and recovery was more interesting than it should have been
-
strace on a running production process, carefully
Attaching strace to a live process with -f -p, filtering syscalls, and spotting a fd leak pattern.
-
An admission webhook that crashed my cluster
A validating webhook with a cycle of dependencies prevented its own webhook pods from being rescheduled, and the cluster froze
-
Debugging a crashloop pod the way I always do it
The kubectl commands I run, in order, when a pod is in CrashLoopBackOff — from describe to previous logs to fix.
-
Terraform state locks and the S3 bucket that wouldn't let go
A CI job was killed mid-apply and left a DynamoDB lock behind, and the recovery taught me to be much more careful about force-unlock
-
A TCP RST that took a week to track down
A long-lived HTTP connection got RST every 12 hours, and the answer lived in the intersection of conntrack, a load balancer, and a very patient test
-
An operator reconcile loop that wouldn't quit
An operator kept thrashing at 300 reconciles per second, and the bug was a single annotation I was setting on the managed resource
-
systemd timers and the clock drift that ate our backups
Our backups stopped running for nine days and the cause was a quiet combination of OnCalendar, RandomizedDelaySec, and a drifting RTC
-
Why my homelab Pi-hole kept forgetting its DNS override
A local DNS override on my Pi-hole would come back for a day and then mysteriously vanish, and the culprit was a docker-compose restart policy
-
Three days of debugging a cgroup memory accounting bug
A service kept getting OOM-killed with plenty of memory headroom, and the trail led into the cgroup v2 memory controller and its file-backed accounting