Finding TCP retransmits with bpftrace

Every so often a server on the oncall rotation shows elevated TCP retransmit counters in ss -s or in the Tcpextretransseg metric. Knowing the count is up is not very helpful. Knowing which process and which peer is behind it is actually actionable. Here is the bpftrace script I keep in ~/ops/bpf/ for this.

The script

#!/usr/bin/env bpftrace

kprobe:tcp_retransmit_skb
{
    $sk = (struct sock *)arg0;
    $family = $sk->__sk_common.skc_family;
    if ($family == AF_INET) {
        $dport = bswap($sk->__sk_common.skc_dport);
        $daddr = ntop($sk->__sk_common.skc_daddr);
        $saddr = ntop($sk->__sk_common.skc_rcv_saddr);
        printf("%s pid=%d %s -> %s:%d\n", comm, pid, $saddr, $daddr, $dport);
        @retrans[comm, $daddr, $dport] = count();
    }
}

interval:s:10 {
    print(@retrans);
    clear(@retrans);
}

Save as tcp_retransmits.bt, run with sudo bpftrace tcp_retransmits.bt. Output:

nginx pid=12411 10.20.1.5 -> 10.20.7.23:8081
nginx pid=12411 10.20.1.5 -> 10.20.7.23:8081
kafka pid=13009 10.20.1.5 -> 10.20.9.112:9093
...
@retrans[nginx, 10.20.7.23, 8081]: 147
@retrans[kafka, 10.20.9.112, 9093]: 12

That’s usually enough to know where the bleeding is. For our case last week the nginx-to-upstream path was getting 147 retransmits in 10 seconds, and the upstream turned out to be a misbehaving instance that was doing long GC pauses.

Why bpftrace over tcpdump

tcpdump can certainly show retransmits. The problem is volume. A busy server has enough packets that filtering retransmits in userspace is painful and you usually end up with huge pcaps and grep-fu. bpftrace runs in the kernel and only fires on the specific event we care about (tcp_retransmit_skb), so the overhead is trivial on all but the most pathological workloads.

Also, bpftrace gives us the process context. tcpdump cannot tell you “these retransmits came from pid 12411”, because by the time a packet is on the wire the task context is gone. The tracepoint version knows the kernel call came from the socket owned by that task.

A caveat about process attribution

bpftrace’s pid on this probe is the process that the kernel is currently running in when tcp_retransmit_skb fires. That may be the process that owns the socket, but on sendpage or softirq paths it can be a kthread or a different task. You will sometimes see ksoftirqd/3 show up. That is a sign that the retransmit fired in softirq context and we did not get the owning task. You can usually correlate with the peer address.

For serious attribution I have a longer script that joins against the socket’s owning task via sock->sk_socket->file->f_owner, but it relies on kernel internals that drift between versions, so I stopped bothering.

Counting by TCP state

I sometimes want to break it down by which state the connection was in. There is a tracepoint for that:

#!/usr/bin/env bpftrace

tracepoint:tcp:tcp_retransmit_skb
{
    @[args->state] = count();
}

With state values from include/net/tcp_states.h:

1  ESTABLISHED
2  SYN_SENT
3  SYN_RECV
4  FIN_WAIT1
...

When SYN_SENT retransmits dominate, you have a connection-establishment problem (firewall, remote down, etc.). When ESTABLISHED retransmits dominate, you have a packet-loss problem mid-stream. The treatment is different.

Combining with ss

Once bpftrace points me at a peer, I use ss to look at the specific connection details:

ss -tin dst 10.20.7.23
# State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
# ESTAB   0       4194304 10.20.1.5:42312     10.20.7.23:8081
#       cubic wscale:7,7 rto:230 rtt:15.324/4.112 mss:1410 pmtu:1500
#       rcvmss:536 advmss:1448 cwnd:10 ssthresh:10 bytes_sent:18MB
#       bytes_retrans:3MB segs_out:12384 segs_in:9422 data_segs_out:12310
#       send 7.4Mbps lastsnd:0 lastrcv:12 lastack:0 pacing_rate 14.9Mbps
#       delivery_rate 8.1Mbps delivered:11104 app_limited busy:23ms
#       retrans:0/1276 rcv_rtt:27.3 rcv_space:14480 minrtt:8.2

retrans:0/1276 means “0 outstanding retransmits, 1276 lifetime retransmits on this connection”. On a connection that has only moved 12k segments, a 10% retransmit rate is enormous. That’s the nail in the coffin for the peer being bad.

Running this without installing bpftrace

You can usually get this via distro package (apt install bpftrace on Debian 12+, dnf install bpftrace on Fedora). On an older kernel where bpftrace is not packaged, the static binary from upstream works in a pinch. I carry a copy in my ~/ops/bin for machines where I cannot install packages on a whim.

Reflection

This is one of those scripts that I wrote once, keep in my dotfiles, and run a couple of times a month. It has paid back its cost many times over. If you are responsible for Linux boxes where TCP retransmits ever show up in your dashboards, it is worth the ten minutes to learn bpftrace enough to write something like this.

Related: see my post on MTU, MSS and a VPN that couldn’t stream video for a case where retransmits were the first symptom but not the real issue.