Posts
Things I’ve written that are long enough to deserve an introduction. Mostly backend engineering, occasional detour into photography and espresso machines.
-
Debugging DNS in a kind cluster
CoreDNS inside a kind cluster could resolve cluster names but not external ones, and the problem was in the host's resolver, not k8s
-
btrfs scrub found what smartctl missed
A disk that passed every SMART test had silent corruption in an old set of blocks, and btrfs scrub caught it two weeks before it would have mattered
-
The OOM killer picked postgres. Here's why.
A staging VM's OOM killer reliably picked postgres over a misbehaving test process, and the fix was understanding oom_score_adj
-
MTU, MSS, and a VPN that couldn't stream video
A perfectly-working WireGuard tunnel that failed only on video streams, and the diagnosis that made me finally understand MSS clamping
-
A TLS SAN quirk that broke mTLS
An internal service stopped accepting a client cert after a seemingly innocent renewal, and the issue hid in the SAN encoding
-
When journald ate my disk
A misconfigured journal rate-limit and a noisy process combined to fill a 100 GB disk in a week, and recovery was more interesting than it should have been
-
Nomad vs k8s for a homelab in 2024
I ran a small Nomad cluster next to my k3s for a month to compare, and I have unsurprising opinions
-
ndots:5 and the DNS tax in Kubernetes
Every non-cluster DNS lookup in our pods was paying for five failed attempts first, and lowering ndots cut tail latency significantly
-
An admission webhook that crashed my cluster
A validating webhook with a cycle of dependencies prevented its own webhook pods from being rescheduled, and the cluster froze
-
Putting IPv6 on my home network, finally
After years of excuses I enabled IPv6 end to end at home, and most of the friction came from devices I did not expect
-
Finding TCP retransmits with bpftrace
A short bpftrace script that pinpoints which process and peer are responsible for TCP retransmits on a noisy box
-
Terraform state locks and the S3 bucket that wouldn't let go
A CI job was killed mid-apply and left a DynamoDB lock behind, and the recovery taught me to be much more careful about force-unlock
-
Tuning ZFS ARC on my Proxmox box
My Proxmox host kept ballooning to 60 GB of ARC and starving VMs, and the fix was not what I expected
-
Why I finally switched from nginx to Caddy
After a decade on nginx, two weekends of YAML and a lingering distaste for certbot cronjobs made me try Caddy for my homelab ingress
-
WireGuard vs Tailscale in my homelab after a year
After a year of running both, here's where each one earned its keep and where I'd pick differently next time