SIMD in Rust without losing my mind

I spent a week vectorizing a checksum routine in Rust last month, and I came away with a genuine soft spot for std::simd and for Rust’s broader SIMD story. It’s still nightly-only, which is an obstacle, but the ergonomics are good, the codegen is good, and once you’ve done it once it’s way less scary than it looks.

The problem: we had a streaming checksum over byte buffers, a custom one, basically a 32-bit running sum with some mixing:

fn checksum(bytes: &[u8]) -> u32 {
    let mut sum: u32 = 0;
    for &b in bytes {
        sum = sum.wrapping_add(b as u32);
        sum = sum.wrapping_mul(0x01000193); // FNV-style
    }
    sum
}

This ran at about 400 MB/s on my machine. It’s the kind of workload where you’d hope for multi-GB/s with SIMD.

The three paths.

Rust has three ways to write SIMD:

std::arch intrinsics: platform-specific, #[cfg(target_arch = "x86_64")], hand-written for each architecture. What you’d write in C with <immintrin.h>.
std::simd (nightly): portable SIMD types like u32x8. Rust lowers these to the best instructions for your target.
packed_simd / wide crates: third-party ports of the same idea, more or less.

For a one-off like this, I reached for std::simd:

#![feature(portable_simd)]
use std::simd::{Simd, SimdUint};

fn checksum_simd(bytes: &[u8]) -> u32 {
    const LANES: usize = 16;
    let (prefix, chunks, suffix) = bytes.as_chunks::<LANES>();
    // chunks: &[[u8; 16]]

    let mut sum = Simd::<u32, LANES>::splat(0);
    let mult = Simd::<u32, LANES>::splat(0x01000193);

    for chunk in chunks {
        let v = Simd::<u8, LANES>::from_array(*chunk);
        let v32: Simd<u32, LANES> = v.cast();
        sum = (sum + v32) * mult;
    }

    // horizontal reduce: combine the LANES lanes
    let lanes = sum.to_array();
    let mut total = 0u32;
    for l in lanes {
        total = total.wrapping_add(l).wrapping_mul(0x01000193);
    }
    for &b in suffix {
        total = total.wrapping_add(b as u32).wrapping_mul(0x01000193);
    }
    total
}

A few subtleties:

The semantic change. My SIMD version computes 16 parallel partial checksums, then combines them. This isn’t the same value as the sequential checksum. For a checksum, that’s fine — it just has to be deterministic and collision-resistant, not compatible with the scalar version.
as_chunks::<N>() splits a slice into fixed-size chunks plus a suffix. It’s stable as slice::as_chunks (or chunks_exact if you prefer the stable API).
cast() does a lane-wise type conversion. u8 to u32 is a zero-extension.
The horizontal reduce at the end is doing the same checksum across the 16 lane results. There are other ways to combine (XOR, just add, etc.), but whatever combiner you pick needs to preserve the checksum’s properties.

Performance: ~4.2 GB/s, roughly 10x the scalar version. On my M2 (via std::simd lowered to ARM NEON) it’s about 8x. Not bad for a port that took an afternoon.

Verifying the codegen.

I used cargo asm (from cargo-show-asm) to look at what actually got generated:

cargo asm --simplify checksum_simd

The inner loop had vpaddd and vpmulld instructions — that’s AVX2 32-bit add and multiply. If I’d compiled without -C target-cpu=native, it would have fallen back to SSE2 versions. On an ARM target, it uses add.4s and mul.4s NEON instructions.

Things that surprised me.

Compile-time checking of lane widths. If you try to operate on Simd<u32, 4> and Simd<u32, 8>, it doesn’t compile. The lane count is part of the type. This caught a bug where I’d accidentally used LANES=16 for one operation and LANES=8 for another.

Horizontal reductions are slow. sum.reduce_sum() works, but it compiles to a sequence of shuffle-and-add, which is significantly slower than the per-lane operations. If you have a choice, accumulate in SIMD lanes and reduce at the end, not in the loop.

Unaligned loads are mostly fine. On x86_64, vmovdqu (unaligned load) is the same speed as vmovdqa (aligned load) on modern CPUs. On ARM, unaligned loads are also fine. I don’t bother forcing alignment for typical use.

target-cpu=native is a must for development benchmarks. Without it, the compiler targets a baseline x86-64 which means SSE2 only. My development machine has AVX-512 — using it requires either -C target-cpu=native or -C target-feature=+avx512f. For production builds, you probably want to target a specific baseline (e.g., x86-64-v3) rather than native, since your users might not have the same CPU as the build machine.

Autovectorization is surprisingly good. I tried compiling the original scalar version with -O3 and -C target-cpu=native just to see. It was about 800 MB/s — 2x faster than the naive build, with no code changes. LLVM’s autovectorizer is quite good for simple loops. My hand-written SIMD still wins (4.2 GB/s vs 800 MB/s) because the autovectorizer is conservative about the FNV-like multiply chain, but it’s closer than you’d expect.

What I didn’t vectorize.

Code with branches. The overhead of SIMD-masking branches — computing both sides and blending — is usually worth it only when the branches are unpredictable AND the code is simple. For our checksum, there were no branches in the inner loop, so this didn’t come up.

Code with variable-length dependencies. If each iteration depends on the last, SIMD is out. That’s why I had to restructure the checksum to compute parallel partial sums — the sequential version’s dependency chain killed parallelism.

Code that’s not actually hot. SIMD is worth writing for code that runs billions of times a day. For a one-off parser, just write the scalar version.

Is std::simd stable yet? As of Rust 1.77 (March 2024), it’s still nightly-only. The wide and portable_simd-via-shim crates let you write similar code on stable, with slightly different APIs. For production code I don’t want to ship on nightly, I’d use wide or fall back to std::arch intrinsics with #[cfg] gates.

Bonus tip: std::simd::u8x32::from_slice(&bytes[0..32]) gives you a load from the front of the slice. Useful when chunking manually.

SIMD is one of those things that feels mystical until you sit down with a real example. After one checksum, a second routine (a UTF-8 validator) took me half a day. After that, I just reach for it whenever I have a hot loop over bytes.

Related: the branch predictor post covers the flip side — sometimes autovectorization is NOT what you want because you lose the branch predictor’s help on realistic data.