Switching to TCP BBR on the edge

Every few years a “everyone should be running TCP BBR” post makes the rounds. BBR is genuinely good for a lot of workloads. I finally got around to testing it properly on our edge fleet. The results were positive, with caveats that are worth spelling out.

Background

TCP congestion control is how a sender decides how fast to send. CUBIC, the Linux default for over a decade, uses loss as the primary signal; when a packet drops, back off. BBR (Bottleneck Bandwidth and RTT) uses measured bandwidth and minimum RTT to estimate the path’s capacity directly and paces to fit.

In theory, BBR is better on paths with shallow buffers (typical home broadband) and paths with occasional loss that is not from congestion (wireless, some hops with AQM). CUBIC tends to overestimate capacity on deep-buffered paths and suffer from bufferbloat; BBR should do better.

Our traffic

Our edge boxes serve HTTPS to a global audience. Average response size is 15 KB (JSON API). P99 response size is 200 KB. Clients are distributed; a lot of mobile, some corporate, some home broadband. We had been on CUBIC forever.

The switch

Changing congestion control is a one-liner:

sysctl -w net.ipv4.tcp_congestion_control=bbr

To persist:

echo 'net.ipv4.tcp_congestion_control=bbr' >> /etc/sysctl.d/99-bbr.conf

BBR requires the fq or fq_codel qdisc (packet scheduler) for pacing to work right. Check:

tc qdisc show
# qdisc fq_codel 0: dev eth0 root refcnt 2 ...

We were already on fq_codel by default on Debian 12. If not, set:

sysctl -w net.core.default_qdisc=fq

How we measured

Rolled BBR to one edge box out of four. Left the others on CUBIC. Let real traffic flow for a week. Compared response time distributions via our metrics.

edge-01 (BBR):
  p50:   28 ms
  p95:   142 ms
  p99:   310 ms

edge-02, 03, 04 (CUBIC):
  p50:   32 ms
  p95:   176 ms
  p99:   390 ms

P50 improvement: 12%. P95: 19%. P99: 20%. These are median response times over a week, weighted toward larger responses for the tail.

Who got the win

Not everyone. Breaking down by client geography and ASN, the clients that saw the biggest win were:

Mobile on cellular (expected: some loss, BBR shines)
Long-distance connections (high BDP, BBR’s correct pacing helps fill the pipe)
Clients behind flaky Wi-Fi (similar to mobile)

Clients that saw essentially no change:

Other cloud providers in the same data centre
Local clients on our own network
Any path with very low RTT

This matches the theory. BBR is not magic for short, fat, clean paths. It is magic for long, thin, or lossy paths. If your users all live in one data centre, you will see no win.

The bufferbloat test

Independent of BBR, I wanted to know whether my edge was suffering from bufferbloat. flent has an rrul test that characterizes the problem:

flent rrul -H edge-01.example.net -l 30 -t our-edge-bbr
flent rrul -H edge-02.example.net -l 30 -t our-edge-cubic

The results showed the CUBIC box had noticeable induced latency under load (a classic bufferbloat signature). The BBR box held latency much more steadily under the same load. For interactive workloads on the same TCP connection (not our case, but a reminder) this would matter a lot.

What didn’t change

Throughput for large file transfers was roughly equivalent. Both congestion controllers fill the pipe; BBR just fills it more evenly.
Connection establishment time (3-way handshake) is not affected by cc.
TLS handshake time is not affected, modulo some shaving because less bufferbloat.

Caveats

BBRv1 can be unfair to CUBIC flows on the same bottleneck. If your server is sharing a bottleneck with CUBIC consumers, BBR may consume more than its share. For a public edge box, this is fine; for a more controlled environment you might want to think about it.
BBR version matters. The default on current Linux kernels is still BBRv1. BBRv2 is in some kernels behind a flag and is better about fairness. BBRv3 is in development.
BBR does not help with loss-sensitive flows. If you are doing real-time video over RTP, that is a different animal.
Some middleboxes are surprised by BBR’s pacing and may mangle it. I have not seen this in our environment but I have read about it happening.

Sysctl tuning alongside

BBR benefits from some adjacent tuning. We had already done most of this:

# allow larger TCP windows
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# enable window scaling
net.ipv4.tcp_window_scaling = 1

# fq qdisc for BBR pacing
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# TCP Fast Open where we can
net.ipv4.tcp_fastopen = 3

The TCP buffer max values want to be large enough that the BDP of your largest flow fits. 16 MB is ample for most public-internet paths.

Rollout

We rolled BBR to all edge boxes over a week, one at a time, checking that P95/P99 improved or at least did not regress. On the fourth box we saw a minor regression on one specific ASN (a corporate VPN that exits somewhere weird); we moved on anyway because the overall curve was strongly positive.

Reflection

BBR is a worthwhile switch if your traffic patterns match ours. The 20% P99 improvement for real public internet traffic is legitimately nice. For internal-to-internal or data-centre-local traffic, the change is effectively nothing. If anyone tells you BBR is a flat “always better”, they are oversimplifying.

Simple rule: if you serve mobile or geographically distant users and you are on Linux, try BBR. Measure. Keep it if it helps. Related: see my post on finding TCP retransmits with bpftrace and MTU, MSS and a VPN saga for more TCP-layer fun.