io_uring surprised me in a benchmark
I was curious whether io_uring could speed up a small service we have that mostly shuffles bytes between sockets and disk. I benchmarked. The numbers were surprisingly in both directions depending on what I measured.
The service
A log-processing daemon: reads newline-delimited JSON from a TCP socket, validates, appends to a file, acks back on the socket. Single threaded, cpu-modest, I/O heavy. Written in C with an epoll loop. I wrote an io_uring variant to compare.
The naive benchmark
Stream 10 million small log lines from a local client on localhost, measure throughput.
# epoll version
./logd-epoll &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
nc -N 127.0.0.1 9000
# 10M lines / 12.4 s = 806k lines/s
# io_uring version
./logd-iouring &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
nc -N 127.0.0.1 9000
# 10M lines / 6.9 s = 1.45M lines/s
1.8x improvement. I was feeling great. “io_uring for the win”, etc.
The realistic benchmark
Same thing but with the service running in its actual prod-ish configuration: disk is a non-local XFS mount, TLS is enabled on the inbound socket, and the client is a real producer that sends in batches with pauses.
./logd-tls-epoll &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 198.2k lines/s, cpu: 73%
./logd-tls-iouring &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 181.4k lines/s, cpu: 88%
Slower and more expensive. What happened?
Instrumenting
I ran both under perf stat -d to see what the CPU was doing:
perf stat -d ./logd-epoll ...
# 4,213,221,002 cycles
# 3,112,009,421 instructions ( 0.74 insns per cycle )
# 142,332 context-switches
# 12,401,001 cache-misses
perf stat -d ./logd-iouring ...
# 5,881,223,411 cycles
# 4,002,118,223 instructions ( 0.68 insns per cycle )
# 38,102 context-switches
# 14,221,003 cache-misses
Context switches were way down with io_uring, as expected. But total cycles were up, and cache misses went up too.
Reading the flame graph, the io_uring version was spending a nontrivial chunk of time in TLS record handling. Specifically, SSL_write on the OpenSSL path was being called with a larger number of small writes. The io_uring setup I had written issued a CQE per read, and I was feeding each read directly into SSL_write without buffering. On the epoll version, I was implicitly batching because reads only completed at epoll granularity.
In other words, my naive port of epoll to io_uring had unintentionally reduced batching. The service spent more time in TLS per byte.
The fix
Two changes:
Batch io_uring completions before processing. Instead of processing each CQE as it becomes available, I now drain up to N CQEs per loop iteration and process them together:
unsigned head; unsigned count = 0; io_uring_for_each_cqe(&ring, head, cqe) { batch[count++] = *cqe; if (count >= BATCH_SIZE) break; } io_uring_cq_advance(&ring, count); process_batch(batch, count);Use
io_uring_prep_writevfor the disk appends with a gathered vector. This lets multiple small writes from different sources collapse into one syscall.
After:
./logd-tls-iouring-v2 &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 207.4k lines/s, cpu: 61%
Now io_uring wins by about 5% throughput at lower CPU. Not the 1.8x my naive test promised, but still a win, and more importantly I now understand where the gain comes from.
What I took away
- io_uring moves some of the benefit from “fewer syscalls” to “better batching”. If your code does not batch to take advantage, you give some of the benefit back.
- The naive “translate every read to a SQE” port is a known antipattern. The uring_prep_X functions are expressive enough that you can let the kernel do more work per completion.
- SQPOLL mode is another lever. Setting
IORING_SETUP_SQPOLLmoves submission queue processing into a kernel thread, so userspace does not need to enter the kernel to submit. That helped another workload of mine by a measurable 10%.
When io_uring is not the right choice
- If your app is CPU-bound, io_uring does not help you.
- If your app uses synchronous libraries (OpenSSL can be used async but most apps don’t), your I/O model may not map cleanly onto completion-based semantics.
- On older kernels (<5.11 or so), bugs were real. On modern kernels (6.x) it is stable in my experience.
- Your OS security posture may matter: some container runtimes disable io_uring via seccomp. Check
runc’s defaults if you are planning to ship this.
Reflection
Benchmarks are a great way to learn what a feature actually does. The “1.8x win on localhost” number would have been a very misleading pull quote if I had published that alone. Your workload is your workload. If you want to know whether io_uring helps, port the hot path, instrument, batch correctly, and measure under realistic conditions.
Related: see my post on finding TCP retransmits with bpftrace for another “measure what the kernel is actually doing” tool.