I was curious whether io_uring could speed up a small service we have that mostly shuffles bytes between sockets and disk. I benchmarked. The numbers were surprisingly in both directions depending on what I measured.

The service

A log-processing daemon: reads newline-delimited JSON from a TCP socket, validates, appends to a file, acks back on the socket. Single threaded, cpu-modest, I/O heavy. Written in C with an epoll loop. I wrote an io_uring variant to compare.

The naive benchmark

Stream 10 million small log lines from a local client on localhost, measure throughput.

# epoll version
./logd-epoll &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
  nc -N 127.0.0.1 9000
# 10M lines / 12.4 s = 806k lines/s

# io_uring version
./logd-iouring &
yes '{"ts":"now","level":"info","msg":"hello"}' | head -10000000 | \
  nc -N 127.0.0.1 9000
# 10M lines / 6.9 s = 1.45M lines/s

1.8x improvement. I was feeling great. “io_uring for the win”, etc.

The realistic benchmark

Same thing but with the service running in its actual prod-ish configuration: disk is a non-local XFS mount, TLS is enabled on the inbound socket, and the client is a real producer that sends in batches with pauses.

./logd-tls-epoll &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 198.2k lines/s, cpu: 73%

./logd-tls-iouring &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 181.4k lines/s, cpu: 88%

Slower and more expensive. What happened?

Instrumenting

I ran both under perf stat -d to see what the CPU was doing:

perf stat -d ./logd-epoll ...
# 4,213,221,002 cycles
# 3,112,009,421 instructions ( 0.74 insns per cycle )
# 142,332 context-switches
# 12,401,001 cache-misses

perf stat -d ./logd-iouring ...
# 5,881,223,411 cycles
# 4,002,118,223 instructions ( 0.68 insns per cycle )
# 38,102 context-switches
# 14,221,003 cache-misses

Context switches were way down with io_uring, as expected. But total cycles were up, and cache misses went up too.

Reading the flame graph, the io_uring version was spending a nontrivial chunk of time in TLS record handling. Specifically, SSL_write on the OpenSSL path was being called with a larger number of small writes. The io_uring setup I had written issued a CQE per read, and I was feeding each read directly into SSL_write without buffering. On the epoll version, I was implicitly batching because reads only completed at epoll granularity.

In other words, my naive port of epoll to io_uring had unintentionally reduced batching. The service spent more time in TLS per byte.

The fix

Two changes:

  1. Batch io_uring completions before processing. Instead of processing each CQE as it becomes available, I now drain up to N CQEs per loop iteration and process them together:

    unsigned head;
    unsigned count = 0;
    io_uring_for_each_cqe(&ring, head, cqe) {
        batch[count++] = *cqe;
        if (count >= BATCH_SIZE) break;
    }
    io_uring_cq_advance(&ring, count);
    process_batch(batch, count);
    
  2. Use io_uring_prep_writev for the disk appends with a gathered vector. This lets multiple small writes from different sources collapse into one syscall.

After:

./logd-tls-iouring-v2 &
./bench-client --rate 200k --duration 60s --target prod-like-host:9000
# throughput: 207.4k lines/s, cpu: 61%

Now io_uring wins by about 5% throughput at lower CPU. Not the 1.8x my naive test promised, but still a win, and more importantly I now understand where the gain comes from.

What I took away

  • io_uring moves some of the benefit from “fewer syscalls” to “better batching”. If your code does not batch to take advantage, you give some of the benefit back.
  • The naive “translate every read to a SQE” port is a known antipattern. The uring_prep_X functions are expressive enough that you can let the kernel do more work per completion.
  • SQPOLL mode is another lever. Setting IORING_SETUP_SQPOLL moves submission queue processing into a kernel thread, so userspace does not need to enter the kernel to submit. That helped another workload of mine by a measurable 10%.

When io_uring is not the right choice

  • If your app is CPU-bound, io_uring does not help you.
  • If your app uses synchronous libraries (OpenSSL can be used async but most apps don’t), your I/O model may not map cleanly onto completion-based semantics.
  • On older kernels (<5.11 or so), bugs were real. On modern kernels (6.x) it is stable in my experience.
  • Your OS security posture may matter: some container runtimes disable io_uring via seccomp. Check runc’s defaults if you are planning to ship this.

Reflection

Benchmarks are a great way to learn what a feature actually does. The “1.8x win on localhost” number would have been a very misleading pull quote if I had published that alone. Your workload is your workload. If you want to know whether io_uring helps, port the hot path, instrument, batch correctly, and measure under realistic conditions.

Related: see my post on finding TCP retransmits with bpftrace for another “measure what the kernel is actually doing” tool.