A TCP RST that took a week to track down

Our event processor maintained a single long-lived HTTP connection to an upstream service. It was supposed to be cheap: open once, stream forever. Instead we kept seeing reconnects exactly every 12 hours, each preceded by a TCP RST. Nothing in our code explained it. Nothing in the upstream’s docs explained it. I spent most of a week chasing this.

First observations

From the application logs:

2024-05-10T03:14:22Z INFO  reconnecting upstream: unexpected EOF
2024-05-10T15:14:23Z INFO  reconnecting upstream: unexpected EOF
2024-05-11T03:14:23Z INFO  reconnecting upstream: unexpected EOF

That is so regular it has to be a timer. The one-second jitter is the tell of something happening on a hard schedule but handled asynchronously.

I set up a tcpdump capture on the client side, pinned to the specific remote IP:

tcpdump -i any -s 0 -w /var/log/pcaps/upstream-$(date +%s).pcap \
    'host 10.42.11.8 and tcp port 443'

Let it run for a day. The pcap showed the server sending a RST from its side, not a FIN. That rules out clean close. TTL of the RST was consistent with the upstream’s network, so it was not an obvious middlebox.

The first red herring

I checked keepalive settings on both sides. The connection was TLS, HTTP/1.1, Connection: keep-alive. The client had tcp_keepalive_time at 7200 seconds, i.e. 2 hours, and a probe every 75 seconds. The server had a 60-second keepalive timeout. That is a well-known configuration for surprise RST, but it should fire more often than every 12 hours. The regularity did not match keepalive mismatches.

The second red herring

I checked the upstream’s load balancer for a max connection lifetime. Their docs mentioned an “idle” timeout of 350 seconds. No max lifetime advertised. I emailed their team. They did not reply for three days, which is why this took a week.

What I actually found

I noticed that the interval was not exactly 12 hours. It was drifting slightly. I plotted the reconnect times and fit a line. The slope was a couple of seconds per day. That killed “some server-side cron”, because cron does not drift. It pointed at a timeout somewhere whose clock was starting from when the connection opened.

The client boxes had conntrack. I ran:

sysctl net.netfilter.nf_conntrack_tcp_timeout_established
# net.netfilter.nf_conntrack_tcp_timeout_established = 43200

43200 seconds is 12 hours. There is the number.

What was happening: our outbound traffic went through a conntrack-stateful NAT box (a VM handling egress for our lab network). If no packet was seen in either direction for 12 hours, conntrack would expire the entry and drop the flow on the floor. Our connection was not actually idle for 12 hours; bytes moved both ways regularly. But, and this is the subtle part, conntrack’s “established” timer is refreshed only on packets that see tcp flags, and the long HTTP/2-ish stream had a pathological pattern where our client sent only application data, the upstream sent back a lot of data, and periodic keepalive-at-the-HTTP-level probes kept the connection alive from the application’s point of view without actually crossing the egress box in a way that reset the timer.

Actually, that is what I thought at first. Watching the pcap more carefully, packets were flowing regularly. The conntrack expiry should have been refreshing. But conntrack -L on the NAT box showed:

conntrack -L | grep 10.42.11.8
# tcp 6 43199 ESTABLISHED src=10.0.0.12 dst=10.42.11.8 sport=49314 dport=443 ...

43199 seconds remaining on a connection that had been passing packets for hours. The counter was not being refreshed at all. That was the real bug. And it was because the NAT box had nf_conntrack_tcp_loose = 0 combined with an earlier problem in our traffic: there had been a brief asymmetry during a failover six hours earlier, and some packets had taken a different path. conntrack had seen a sequence number jump it did not like and had marked the flow “INVALID” for tracking, meaning it would no longer refresh the timer. It still allowed the packets (because our rules did not drop INVALID), but it did not reset the expiry.

The fix

Three things:

Bump nf_conntrack_tcp_timeout_established to 86400 seconds. An extra day of buffer costs nothing but table memory.
Set nf_conntrack_tcp_loose = 1 on the egress box so that it re-syncs conntrack state if a flow shows up mid-stream. This is not a universal recommendation (it can hide real issues) but for our egress it is fine.
Move the long-lived connection to gRPC with HTTP/2 PING frames at 30s intervals. Even if conntrack gets grumpy, the PINGs push packets and are defensible in terms of byte cost.

Reflection

I learned to plot reconnect intervals early. Drift versus constant periodicity tells you whether it is a timer from connection start or a scheduled action. I also learned that conntrack’s INVALID state is silent: the flow keeps working, the timer just stops ticking. I now alert on nf_conntrack invalid packet counters because they are usually a leading indicator of something like this.

If you are running a stateful NAT in front of anything that holds a connection for more than an hour, print that timeout on your wall. Our team now has a diagram of every conntrack-managed hop in our network with its expiry value. Related: see my post on MTU, MSS and a VPN that couldn’t stream video for another “everything was working except the bits that weren’t” story.