The branch predictor ate my microbenchmark

I spent a weekend last month convincing myself that a lookup-table-based decoder was faster than a branch-based one, only to discover that the benchmark was being ruined by my CPU’s branch predictor. Here’s the story.

The problem: decoding a stream of small integers encoded using a variable-length scheme. Each byte either contains the continuation bit (more bytes to come) or terminates a value. Classic varint.

Version A, branch-based:

uint64_t decode_branch(const uint8_t *buf, size_t *pos) {
    uint64_t v = 0;
    int shift = 0;
    while (1) {
        uint8_t b = buf[(*pos)++];
        v |= ((uint64_t)(b & 0x7f)) << shift;
        if ((b & 0x80) == 0) break;
        shift += 7;
    }
    return v;
}

Version B, table-driven, with a lookup table indexed by byte:

// lookup table: for each possible byte, stores (value & 0x7f) and (has_continuation)
struct entry { uint8_t val; uint8_t cont; };
struct entry table[256];

uint64_t decode_table(const uint8_t *buf, size_t *pos) {
    uint64_t v = 0;
    int shift = 0;
    struct entry e;
    do {
        e = table[buf[(*pos)++]];
        v |= ((uint64_t)e.val) << shift;
        shift += 7;
    } while (e.cont);
    return v;
}

My microbenchmark generated a buffer of random varints, then decoded them in a loop and summed. Version B was 1.3x faster than Version A. I was about to deploy it when my colleague Aaron asked me to test with real data.

Real data has a distribution: most values are small (fit in one byte), and the ones that are larger tend to be correlated within a small range. A realistic buffer might have 80% single-byte values, 18% two-byte, and 2% three-or-more. My random buffer had roughly uniform-over-the-byte-length distribution, meaning 50% single-byte, 25% two-byte, etc.

When I reran with realistic distribution, Version A was faster. I flipped back to uniform, Version B was faster. What the hell was going on?

The culprit is the branch predictor. The branch in Version A is if ((b & 0x80) == 0) break;. With a uniform-byte distribution, this branch has almost no pattern — sometimes the byte has the high bit set, sometimes it doesn’t, and the sequence is effectively random. The predictor is wrong a lot, and every mispredict costs ~15-20 cycles.

With realistic data (80% single-byte values), the branch is “not taken” 80% of the time, which the predictor learns within a couple of observations. Mispredicts drop to ~5-10%, and the overall decode speed improves dramatically.

Version B has a data-dependent loop (while (e.cont)) but it’s indirected through the table, and the table lookup itself is a data-cache miss if your buffer is large enough. The predictor can’t learn the pattern as quickly because the lookup serializes with the branch. Version B’s performance is more stable across inputs but has a higher floor.

I wish I had known this when I was writing the benchmark. Instead, I wrote a naive benchmark, got what looked like a clear answer, and was one rubber-stamp review away from shipping the wrong thing.

What I did next:

Used realistic inputs. I dumped a few hundred MB of our actual varint streams from prod, copied the file to my benchmark machine, and replayed it. Numbers matched what I’d expect in production.
Used perf stat to see the actual branch mispredict rate. On Linux:

perf stat -e branches,branch-misses ./bench

This shows you exactly how many branches were executed and how many were mispredicted. Version A on uniform data had a 28% mispredict rate. On realistic data, 3%. Version B was consistently around 4-5% because its branches are fewer but more fundamental.

Tested on multiple input shapes. Uniform, realistic, worst-case (all multi-byte values), best-case (all single-byte values). The decoder’s choice changed based on which case you cared about.

The fundamental issue is that modern CPUs have very good branch predictors, and microbenchmarks with unrealistic inputs often underestimate branchy code. Your production workload probably has exploitable patterns — user IDs that come in bursts, request types that correlate, small values that dominate — and the predictor will find them. A “branchless” version that looks great in a microbench may be slower in production because it forgoes the prediction benefit.

Conversely, if your workload is truly random — cryptographic data, hashed indices, whitenoise — branch-heavy code can be genuinely slow, and branchless wins.

A few other tools and tricks from that afternoon:

Verify with perf stat on any real benchmark. The base metrics — branches, branch-misses, cycles, instructions — tell you a lot about why one version beats another.

Test on different CPUs. Branch predictors differ. An Apple M1 has a different predictor than a Xeon Skylake than a Zen 4. Worst-case, what works best on one CPU family can lose on another.

If your code is if heavy in a hot loop, lean on __builtin_expect (C/C++) or likely!/unlikely! macros where applicable. Or better, structure the code so the common case is the “straight through” path:

// better: common case is first
if (__builtin_expect((b & 0x80) == 0, 1)) {
    // single-byte fast path
    return b & 0x7f;
}
// slow path: multi-byte
...

SIMD. For long streams of varints, there’s a published technique using AVX2 that decodes ~8 varints in parallel by masking. We didn’t need it; Version A with realistic data was plenty fast. But if you really need throughput, SIMD varint decoding exists and can be 10x.

The meta-lesson: the CPU is doing a lot of work to make your naive code fast, and microbenchmarks that use unrealistic inputs can tell you the wrong story about what’s fast. When in doubt, use real data, and check the hardware counters.

See also my post on pprof and assembly which is the Go-specific version of this kind of investigation.