I’ve been hoarding useful bpftrace one-liners the way other people hoard sourdough starters. This is one of them.

The probe attaches to tcp_retransmit_skb and aggregates by (comm, remote_ip). In English: every time the kernel has to retransmit a packet, bump a counter keyed on which process and which peer were involved. Let it run for a minute or two, hit Ctrl-C, and you get a sorted-ish dump.

Notice the output. Most of the processes have single-digit retransmit counts, which is just the normal cost of TCP on a network with occasional packet loss. The interesting line is app-api -> 10.0.2.41: 219. That’s two orders of magnitude higher than everything else, and 10.0.2.41 also shows up as a high offender for postgres (the replica).

The follow-up command, ss -tni dst 10.0.2.41, gives the smoking gun: one ESTAB connection to the database with retrans:0/38 — 38 retransmits on a single connection. RTT is 600ms to a DB on the same subnet, which is absurd. The comment at the end is my actual conclusion: this is almost certainly a flaky NIC, cable, or switch port.

Two things I want to call out:

  • bpftrace is the shortest path between “I have a theory” and “I have data.” The whole investigation is one probe and one ss call.
  • The probe has essentially zero overhead for this workload. You could leave it running during the incident with no concern.

If you do one thing differently from me, learn where the struct field paths live (args->sk->__sk_common.skc_daddr). The BTF debug info has it all.