The pprof graph that taught me to read assembly

Most of the time, a pprof graph tells you a clear story: some function is taking up 40% of your CPU, you click in, and it’s a hash computation or a JSON encode or a regex you didn’t notice you were compiling per request. You fix it, and you move on. I had one case last year that wasn’t like that. The hot function was six lines long, mostly a map lookup, and pprof said it was taking a third of my CPU. I didn’t believe it, and in trying to figure out why, I learned a fair amount about reading Go’s generated assembly.

The code looked like this:

type Classifier struct {
    rules map[string]int
}

func (c *Classifier) Classify(ctx context.Context, req *Request) int {
    key := fmt.Sprintf("%s:%s:%d", req.Tenant, req.Route, req.Version)
    if v, ok := c.rules[key]; ok {
        return v
    }
    return c.fallback(req)
}

Classify was 32% of CPU. At first I assumed it was the fmt.Sprintf. That alone shouldn’t be 32%, but it should be a chunk, since Sprintf allocates and does format parsing. So I replaced it with concatenation:

key := req.Tenant + ":" + req.Route + ":" + strconv.Itoa(req.Version)

CPU dropped a couple percent, not much. Still ~28% in Classify. That’s when I pulled up the pprof disasm view:

go tool pprof -http=:9090 cpu.pprof
# in the UI: View -> Disassembly, or cli: disasm Classify

The disassembly showed something I wasn’t expecting. The hot instructions were inside the map lookup loop, specifically the FNV-style hash computation and what looked like runtime.mapaccess2_faststr. The faststr variant is an optimized map lookup for string keys. Map lookups are not free, but they shouldn’t be 28%.

So I ran go tool pprof -alloc_objects to see allocations. The winner was, surprisingly, the strconv.Itoa. Not because Itoa allocates — it usually doesn’t for small ints — but because the string concatenation builds a new string every call, which allocates, and the map lookup then hashes that string, which takes time proportional to its length. I was allocating and hashing a ~40-byte string on every classification.

The fix was to precompute per-request. Since req has the same values for the life of the request, I could compute the classification key once and cache it:

func (req *Request) cacheKey() string {
    if req.cachedKey != "" {
        return req.cachedKey
    }
    req.cachedKey = req.Tenant + ":" + req.Route + ":" + strconv.Itoa(req.Version)
    return req.cachedKey
}

But the more interesting fix, and the one that actually landed, was restructuring the rules map to not need a string key at all:

type classifierKey struct {
    tenant  string // interned string, reused
    route   string
    version int
}

type Classifier struct {
    rules map[classifierKey]int
}

Go’s map hashing for a struct uses runtime.memhash, which for a struct with two string pointers and an int is a single cache line of work. The per-call string allocation goes away. The map lookup is now essentially a hash of two pointers and an int, plus a memcmp.

The real win, though, was understanding what pprof was showing me. The disasm view told me that runtime.mapaccess2_faststr was a huge chunk of the self-time. The faststr part is important: Go has specialized map access routines for common key types (faststr, fast32, fast64, and the generic path). If your CPU is dominated by mapaccess2_faststr, your string keys are the bottleneck, not the map itself. Swap to mapaccess2_fast64 by using int-ish keys and you’ll often see a huge win.

A few other things I’ve learned to look for in pprof’s disasm view:

Bounds checks (CMP followed by JAE and a call into runtime.panicIndex): Go does bounds checks on every slice access, and the compiler tries to hoist them out of loops. If you see repeated bounds checks in a loop, sometimes s := s[:len(s)] before the loop or a range-with-index helps.
Interface conversions (runtime.convT2Estring, runtime.convT2Enoptr): means you’re boxing a concrete type into an interface value somewhere, which allocates.
Write barriers (runtime.gcWriteBarrier): you’re doing a pointer write that the GC needs to observe. Usually fine, but a hot loop with many pointer writes can eat real CPU.

Reading Go assembly is surprisingly approachable. The calling convention is simple, the register pressure is manageable, and the one-function-per-disasm-block view in pprof is clear. I don’t claim to write Go assembly, but reading a hot 40-instruction block now takes me a couple of minutes instead of being a wall of noise.

This is the post I’d have given past-me to save myself an afternoon of squinting. Pprof at the function level tells you where. Pprof at the disasm level tells you why. And for hot paths, the why is often a tiny specialization question like “strings vs ints” that you can only see once you’ve looked at the generated code.