Benchmarking gotchas I learned the hard way

I spent a good chunk of last year rewriting parts of an encoding library for speed, and during that time I wrote, ran, and misinterpreted a lot of benchmarks. This is a collection of ways go test -bench has lied to me and how I learned to avoid each.

Mistake 1: the compiler optimizing away your benchmark.

func BenchmarkHash(b *testing.B) {
    for i := 0; i < b.N; i++ {
        Hash([]byte("hello world"))
    }
}

If Hash is a pure function and its result is never used, the Go compiler might eliminate the call entirely. You’ll see absurdly fast numbers — picoseconds per op. The fix is to sink the result into a package-level variable the compiler can’t reason about:

var hashResult uint64

func BenchmarkHash(b *testing.B) {
    var h uint64
    for i := 0; i < b.N; i++ {
        h = Hash([]byte("hello world"))
    }
    hashResult = h
}

The trick: hashResult is package-level, so the compiler can’t prove it’s dead. You’re forcing the result to be “observed.”

Mistake 2: allocating inside the timer.

func BenchmarkEncode(b *testing.B) {
    msg := &BigProto{Fields: makeFields(10000)}
    for i := 0; i < b.N; i++ {
        buf := proto.Marshal(msg)
        _ = buf
    }
}

If makeFields is expensive, you want it out of the timed loop. Use b.ResetTimer():

func BenchmarkEncode(b *testing.B) {
    msg := &BigProto{Fields: makeFields(10000)}
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf := proto.Marshal(msg)
        _ = buf
    }
}

Without ResetTimer, your benchmark includes the setup time, which skews everything.

Mistake 3: same input, cold vs warm cache.

Benchmarking Hash([]byte("hello world")) a billion times is not the same as hashing a billion different strings. Your CPU cache is warm; branch predictors have learned; memory is hot. Real-world code hashes diverse inputs. If you want representative numbers, vary your input:

func BenchmarkHash(b *testing.B) {
    inputs := make([][]byte, 1024)
    for i := range inputs {
        inputs[i] = []byte(fmt.Sprintf("key-%d", i))
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = Hash(inputs[i%len(inputs)])
    }
}

This is still not realistic (a real workload has a skewed distribution), but it’s closer.

Mistake 4: not using b.ReportAllocs() when you should.

For allocation-sensitive code, b.ReportAllocs() (or the -benchmem flag) gives you allocs/op and bytes/op. A benchmark that’s fast in time but allocates 20KB per op is not actually fast — you’re pushing GC work forward to other code.

func BenchmarkMarshal(b *testing.B) {
    b.ReportAllocs()
    // ...
}

Mistake 5: relying on a single run.

Benchmark noise is real. Run with -count=10 and use benchstat (from golang.org/x/perf/cmd/benchstat) to compare:

go test -bench=. -count=10 > old.txt
# make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt

benchstat computes confidence intervals. If it says “p=0.089 (n=10)” or something like that, your “improvement” might be noise. I ignore any change with p > 0.05.

Mistake 6: benchmarking in an unpinned environment.

On a laptop with thermal throttling and CPU frequency scaling, your first 10 seconds of benchmark can be at 3.5GHz and your next 10 at 2.1GHz because the CPU got hot. Turn off frequency scaling if you can, or on Linux use performance governor. On macOS I’ve stopped trying to get perfect numbers from my laptop; I run benchmarks in CI on a dedicated machine.

Mistake 7: the loop counter isn’t free.

func BenchmarkAdd(b *testing.B) {
    var x int
    for i := 0; i < b.N; i++ {
        x += i
    }
    _ = x
}

This benchmarks addition, but also the loop overhead. For things that take 1 or 2 nanoseconds, the loop itself can be a significant fraction. Use b.N to your advantage — do more work per iteration:

for i := 0; i < b.N; i += 4 {
    x += i
    x += i + 1
    x += i + 2
    x += i + 3
}

Or better, benchmark something that takes at least 10-20ns so the loop is negligible.

Mistake 8: sub-benchmarks that share state.

func BenchmarkCache(b *testing.B) {
    cache := newCache()
    b.Run("Small", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            cache.Put("k", smallValue())
        }
    })
    b.Run("Large", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            cache.Put("k", largeValue())
        }
    })
}

The cache in Small might be different from the cache in Large, and the order matters. Re-create state between sub-benchmarks if it matters:

b.Run("Small", func(b *testing.B) {
    cache := newCache()
    b.ResetTimer()
    // ...
})

Mistake 9: forgetting to go test -run=^$.

If you run go test -bench=., the tests ALSO run. That warms the package up and can affect numbers. For clean benchmarks:

go test -run=^$ -bench=. -benchmem -count=10

-run=^$ is a regex that matches no test.

Mistake 10: comparing numbers across Go versions without saying so.

The Go compiler gets better every release. A benchmark that showed “X is 2x faster than Y” on Go 1.17 might show “X is 1.1x faster” on Go 1.21 because the compiler got better at Y. Always record your Go version. I bake it into my benchmark output with a comment at the top of the file.

None of this is exotic. It’s mostly just discipline. Benchmarks that are set up carelessly give you numbers you don’t understand; numbers you don’t understand lead to optimizations that don’t actually help. I’ve lost whole afternoons chasing “improvements” that were noise, or that came from the compiler eliminating my benchmark loop. A little paranoia goes a long way.