Debugging

Oct 17, 2024 Debugging DNS in a kind cluster
CoreDNS inside a kind cluster could resolve cluster names but not external ones, and the problem was in the host's resolver, not k8s
Oct 6, 2024 btrfs scrub found what smartctl missed
A disk that passed every SMART test had silent corruption in an old set of blocks, and btrfs scrub caught it two weeks before it would have mattered
Sep 26, 2024 The OOM killer picked postgres. Here's why.
A staging VM's OOM killer reliably picked postgres over a misbehaving test process, and the fix was understanding oom_score_adj
Sep 16, 2024 MTU, MSS, and a VPN that couldn't stream video
A perfectly-working WireGuard tunnel that failed only on video streams, and the diagnosis that made me finally understand MSS clamping
Sep 6, 2024 A TLS SAN quirk that broke mTLS
An internal service stopped accepting a client cert after a seemingly innocent renewal, and the issue hid in the SAN encoding
Aug 27, 2024 When journald ate my disk
A misconfigured journal rate-limit and a noisy process combined to fill a 100 GB disk in a week, and recovery was more interesting than it should have been
Aug 23, 2024 strace on a running production process, carefully
Attaching strace to a live process with -f -p, filtering syscalls, and spotting a fd leak pattern.
Jul 27, 2024 An admission webhook that crashed my cluster
A validating webhook with a cycle of dependencies prevented its own webhook pods from being rescheduled, and the cluster froze
Jul 2, 2024 Debugging a crashloop pod the way I always do it
The kubectl commands I run, in order, when a pod is in CrashLoopBackOff — from describe to previous logs to fix.
Jun 26, 2024 Terraform state locks and the S3 bucket that wouldn't let go
A CI job was killed mid-apply and left a DynamoDB lock behind, and the recovery taught me to be much more careful about force-unlock
May 18, 2024 A TCP RST that took a week to track down
A long-lived HTTP connection got RST every 12 hours, and the answer lived in the intersection of conntrack, a load balancer, and a very patient test
May 8, 2024 An operator reconcile loop that wouldn't quit
An operator kept thrashing at 300 reconciles per second, and the bug was a single annotation I was setting on the managed resource
Apr 29, 2024 systemd timers and the clock drift that ate our backups
Our backups stopped running for nine days and the cause was a quiet combination of OnCalendar, RandomizedDelaySec, and a drifting RTC
Apr 20, 2024 Why my homelab Pi-hole kept forgetting its DNS override
A local DNS override on my Pi-hole would come back for a day and then mysteriously vanish, and the culprit was a docker-compose restart policy
Apr 11, 2024 Three days of debugging a cgroup memory accounting bug
A service kept getting OOM-killed with plenty of memory headroom, and the trail led into the cgroup v2 memory controller and its file-backed accounting