io_uring surprised me in a benchmark

I was curious whether io_uring could speed up a small service we have that mostly shuffles bytes between sockets and disk. I benchmarked. The numbers were surprisingly in both directions depending on what I measured.

The service

A log-processing daemon: reads newline-delimited JSON from a TCP socket, validates, appends to a file, acks back on the socket. Single threaded, cpu-modest, I/O heavy. Written in C with an epoll loop. I wrote an io_uring variant to compare.

The naive benchmark

Stream 10 million small log lines from a local client on localhost, measure throughput.

# epoll version
./logd-epoll &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
  nc -N 127.0.0.1 9000
# 10M lines / 12.4 s = 806k lines/s

# io_uring version
./logd-iouring &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
  nc -N 127.0.0.1 9000
# 10M lines / 6.9 s = 1.45M lines/s

1.8x improvement. I was feeling great. “io_uring for the win”, etc.

The realistic benchmark

Same thing but with the service running in its actual prod-ish configuration: disk is a non-local XFS mount, TLS is enabled on the inbound socket, and the client is a real producer that sends in batches with pauses.

./logd-tls-epoll &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 198.2k lines/s, cpu: 73%

./logd-tls-iouring &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 181.4k lines/s, cpu: 88%

Slower and more expensive. What happened?

Instrumenting

I ran both under perf stat -d to see what the CPU was doing:

perf stat -d ./logd-epoll ...
# 4,213,221,002 cycles
# 3,112,009,421 instructions ( 0.74 insns per cycle )
# 142,332 context-switches
# 12,401,001 cache-misses

perf stat -d ./logd-iouring ...
# 5,881,223,411 cycles
# 4,002,118,223 instructions ( 0.68 insns per cycle )
# 38,102 context-switches
# 14,221,003 cache-misses

Context switches were way down with io_uring, as expected. But total cycles were up, and cache misses went up too.

Reading the flame graph, the io_uring version was spending a nontrivial chunk of time in TLS record handling. Specifically, SSL_write on the OpenSSL path was being called with a larger number of small writes. The io_uring setup I had written issued a CQE per read, and I was feeding each read directly into SSL_write without buffering. On the epoll version, I was implicitly batching because reads only completed at epoll granularity.

In other words, my naive port of epoll to io_uring had unintentionally reduced batching. The service spent more time in TLS per byte.

The fix

Two changes:

Batch io_uring completions before processing. Instead of processing each CQE as it becomes available, I now drain up to N CQEs per loop iteration and process them together:

unsigned head;
unsigned count = 0;
io_uring_for_each_cqe(&ring, head, cqe) {
    batch[count++] = *cqe;
    if (count >= BATCH_SIZE) break;
}
io_uring_cq_advance(&ring, count);
process_batch(batch, count);

Use io_uring_prep_writev for the disk appends with a gathered vector. This lets multiple small writes from different sources collapse into one syscall.

After:

./logd-tls-iouring-v2 &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 207.4k lines/s, cpu: 61%

Now io_uring wins by about 5% throughput at lower CPU. Not the 1.8x my naive test promised, but still a win, and more importantly I now understand where the gain comes from.

What I took away

io_uring moves some of the benefit from “fewer syscalls” to “better batching”. If your code does not batch to take advantage, you give some of the benefit back.
The naive “translate every read to a SQE” port is a known antipattern. The uring_prep_X functions are expressive enough that you can let the kernel do more work per completion.
SQPOLL mode is another lever. Setting IORING_SETUP_SQPOLL moves submission queue processing into a kernel thread, so userspace does not need to enter the kernel to submit. That helped another workload of mine by a measurable 10%.

When io_uring is not the right choice

If your app is CPU-bound, io_uring does not help you.
If your app uses synchronous libraries (OpenSSL can be used async but most apps don’t), your I/O model may not map cleanly onto completion-based semantics.
On older kernels (<5.11 or so), bugs were real. On modern kernels (6.x) it is stable in my experience.
Your OS security posture may matter: some container runtimes disable io_uring via seccomp. Check runc’s defaults if you are planning to ship this.

Reflection

Benchmarks are a great way to learn what a feature actually does. The “1.8x win on localhost” number would have been a very misleading pull quote if I had published that alone. Your workload is your workload. If you want to know whether io_uring helps, port the hot path, instrument, batch correctly, and measure under realistic conditions.

Related: see my post on finding TCP retransmits with bpftrace for another “measure what the kernel is actually doing” tool.