CPU caches explained through a slow hash table

A colleague, Dani, asked me to look at a hash table implementation that was benchmarking well in isolation but was a bottleneck when embedded in a real service. In microbenchmarks it was close to the best-published general-purpose hash tables. In situ, it was 5x slower than std::unordered_map, which nobody would accuse of being fast. I spent an afternoon with perf and came away with a much better intuition for CPU caches than I’d had before.

The structure was an open-addressing table with linear probing. The key insight of the design was supposedly cache-friendly: keys and values were stored in separate parallel arrays, so scanning keys during a lookup would only touch the key array, and the value array would only be accessed on a hit.

typedef struct {
    uint32_t *keys;      // sized 2^n
    void **values;       // same size
    size_t capacity;
    size_t count;
} HashTable;

void* table_lookup(HashTable *t, uint32_t key) {
    size_t mask = t->capacity - 1;
    size_t i = hash(key) & mask;
    while (t->keys[i] != 0) {
        if (t->keys[i] == key) {
            return t->values[i];
        }
        i = (i + 1) & mask;
    }
    return NULL;
}

In microbenchmarks with one table and one query pattern, this is fine. In the real service, there were 40 different tables being queried in an interleaved fashion, and the cache miss rate was 11%. Let me explain why that’s a disaster.

A modern CPU has roughly:

L1 data cache: 32-48 KB, ~1 ns access
L2 cache: 256 KB - 1 MB, ~3-10 ns access
L3 cache: 4-32 MB shared, ~10-30 ns access
Main memory: ~100-200 ns access

A hash table lookup that hits L1 cache is fast. A lookup that misses to L3 is 10-30x slower. A lookup that misses to main memory is 100-200x slower than an L1 hit. The difference between “fast” and “slow” is almost entirely determined by where the data lives.

For a single hash table with a working set smaller than L1, you’ll always hit L1, and the theoretical design advantages matter. But with 40 interleaved tables each several MB, nothing stays in L1. Every lookup is a gamble with the CPU cache.

Dani’s design made this worse because the separate arrays for keys and values meant every lookup that hit required touching TWO cache lines (one for the key, one for the value) in different memory regions. Even a successful lookup was two cache-line fetches.

I proposed switching to a slot-based layout:

typedef struct {
    uint32_t key;
    void *value;
} Slot;

typedef struct {
    Slot *slots;
    size_t capacity;
    size_t count;
} HashTable;

void* table_lookup(HashTable *t, uint32_t key) {
    size_t mask = t->capacity - 1;
    size_t i = hash(key) & mask;
    while (t->slots[i].key != 0) {
        if (t->slots[i].key == key) {
            return t->slots[i].value;
        }
        i = (i + 1) & mask;
    }
    return NULL;
}

Each slot is 16 bytes (key + pointer on 64-bit). A cache line is 64 bytes, so four slots fit per line. A lookup that walks a collision chain touches one cache line and examines up to four slots for free. A hit reads the key AND the value from the same line, so the value read is essentially free.

Benchmarking against the same real service: about 2.5x faster. Still not as fast as I’d hoped, but a large improvement. perf stat -e cache-misses showed the miss rate drop from 11% to 3.5%.

The other optimization: reducing the hash table count. Forty separate tables means each table’s metadata (and some hot slots) competes for cache. When we could, we merged small tables into one larger one keyed by (type tag, key). This reduced the number of cache-hot regions and let the CPU cache what was actually hot.

Things I’ve internalized about caches:

Locality is king. Accessing things close together in memory is fast. Accessing things scattered across memory is slow. Data structure layout matters more than algorithmic complexity for small-ish working sets.

Cache line size is 64 bytes on all mainstream CPUs. x86-64, Apple Silicon, Graviton. If you care about performance, you care about cache lines. struct layout matters. Align your structures deliberately.

False sharing is real. Two threads writing to different variables that share a cache line will contend. For variables written by different threads, use alignas(64) to put them on separate cache lines.

struct Counters {
    alignas(64) std::atomic<uint64_t> reads;
    alignas(64) std::atomic<uint64_t> writes;
};

Without the alignment, reads and writes share a cache line. With contention on either, both suffer. Putting them on separate lines fixes it.

Prefetching helps linear scans. If you’re walking a large array sequentially, the prefetcher figures it out. If you’re walking with a stride or following pointer chains, it can’t, and you eat the full miss cost. __builtin_prefetch(&arr[i + 8]) can manually hint the CPU. In my experience, this rarely helps unless you’ve already optimized everything else.

L1 instruction cache matters too. A hot function with a large body that doesn’t fit in L1i runs slower than a small function that does. Monomorphized generic code can bloat to the point of evicting itself. Measure binary size and instruction cache miss rates if you’re paranoid.

Tools:

perf stat -e cache-misses,cache-references on Linux
Instruments on macOS (less detailed but has a cache-line view)
likwid-perfctr if you want per-function cache statistics

I don’t reach for cache-level analysis on every perf problem. Most bugs are in higher-level things: algorithmic complexity, inefficient allocations, contention. But when you’ve optimized those and you’re still slow, cache behavior is usually the remaining lever. And it’s surprising how often a simple layout change — slot-based instead of parallel arrays — can get a 2-3x win.

See also the branch predictor post for another way the CPU can make your benchmarks lie.