Linux
-
Switching to TCP BBR on the edge
We flipped our edge servers from CUBIC to BBR and measured the actual win for our traffic, which was smaller than the marketing suggests
-
Capabilities bounded the wrong way
A service that could no longer bind to low ports after an innocent systemd change, and what I learned about capability sets
-
Testing routers with Linux network namespaces
A pattern I now use for testing every firewall and routing change before it touches the edge: cheap, repeatable, and on a laptop
-
io_uring surprised me in a benchmark
Replacing an epoll loop with io_uring gave a 1.8x throughput boost in a naive benchmark, and a 0.9x loss in a realistic one
-
btrfs scrub found what smartctl missed
A disk that passed every SMART test had silent corruption in an old set of blocks, and btrfs scrub caught it two weeks before it would have mattered
-
The OOM killer picked postgres. Here's why.
A staging VM's OOM killer reliably picked postgres over a misbehaving test process, and the fix was understanding oom_score_adj
-
When journald ate my disk
A misconfigured journal rate-limit and a noisy process combined to fill a 100 GB disk in a week, and recovery was more interesting than it should have been
-
strace on a running production process, carefully
Attaching strace to a live process with -f -p, filtering syscalls, and spotting a fd leak pattern.
-
Finding TCP retransmits with bpftrace
A short bpftrace script that pinpoints which process and peer are responsible for TCP retransmits on a noisy box
-
Tuning ZFS ARC on my Proxmox box
My Proxmox host kept ballooning to 60 GB of ARC and starving VMs, and the fix was not what I expected
-
A TCP RST that took a week to track down
A long-lived HTTP connection got RST every 12 hours, and the answer lived in the intersection of conntrack, a load balancer, and a very patient test
-
systemd timers and the clock drift that ate our backups
Our backups stopped running for nine days and the cause was a quiet combination of OnCalendar, RandomizedDelaySec, and a drifting RTC
-
Three days of debugging a cgroup memory accounting bug
A service kept getting OOM-killed with plenty of memory headroom, and the trail led into the cgroup v2 memory controller and its file-backed accounting
-
nftables rule ordering surprised me
A two-hour outage caused by a harmless-looking rule insertion into the wrong chain position, and what I learned about nftables evaluation
-
PLT and GOT: the indirection I never noticed
Dynamic linking on ELF is a two-step dance through the PLT and the GOT, and once you see it you cannot unsee it