The pprof graph that taught me to read assembly
Most of the time, a pprof graph tells you a clear story: some function is taking up 40% of your CPU, you click in, and it’s a hash computation or a JSON encode or a regex you didn’t notice you were compiling per request. You fix it, and you move on. I had one case last year that wasn’t like that. The hot function was six lines long, mostly a map lookup, and pprof said it was taking a third of my CPU. I didn’t believe it, and in trying to figure out why, I learned a fair amount about reading Go’s generated assembly.
The code looked like this:
type Classifier struct {
rules map[string]int
}
func (c *Classifier) Classify(ctx context.Context, req *Request) int {
key := fmt.Sprintf("%s:%s:%d", req.Tenant, req.Route, req.Version)
if v, ok := c.rules[key]; ok {
return v
}
return c.fallback(req)
}
Classify was 32% of CPU. At first I assumed it was the fmt.Sprintf. That alone shouldn’t be 32%, but it should be a chunk, since Sprintf allocates and does format parsing. So I replaced it with concatenation:
key := req.Tenant + ":" + req.Route + ":" + strconv.Itoa(req.Version)
CPU dropped a couple percent, not much. Still ~28% in Classify. That’s when I pulled up the pprof disasm view:
go tool pprof -http=:9090 cpu.pprof
# in the UI: View -> Disassembly, or cli: disasm Classify
The disassembly showed something I wasn’t expecting. The hot instructions were inside the map lookup loop, specifically the FNV-style hash computation and what looked like runtime.mapaccess2_faststr. The faststr variant is an optimized map lookup for string keys. Map lookups are not free, but they shouldn’t be 28%.
So I ran go tool pprof -alloc_objects to see allocations. The winner was, surprisingly, the strconv.Itoa. Not because Itoa allocates — it usually doesn’t for small ints — but because the string concatenation builds a new string every call, which allocates, and the map lookup then hashes that string, which takes time proportional to its length. I was allocating and hashing a ~40-byte string on every classification.
The fix was to precompute per-request. Since req has the same values for the life of the request, I could compute the classification key once and cache it:
func (req *Request) cacheKey() string {
if req.cachedKey != "" {
return req.cachedKey
}
req.cachedKey = req.Tenant + ":" + req.Route + ":" + strconv.Itoa(req.Version)
return req.cachedKey
}
But the more interesting fix, and the one that actually landed, was restructuring the rules map to not need a string key at all:
type classifierKey struct {
tenant string // interned string, reused
route string
version int
}
type Classifier struct {
rules map[classifierKey]int
}
Go’s map hashing for a struct uses runtime.memhash, which for a struct with two string pointers and an int is a single cache line of work. The per-call string allocation goes away. The map lookup is now essentially a hash of two pointers and an int, plus a memcmp.
The real win, though, was understanding what pprof was showing me. The disasm view told me that runtime.mapaccess2_faststr was a huge chunk of the self-time. The faststr part is important: Go has specialized map access routines for common key types (faststr, fast32, fast64, and the generic path). If your CPU is dominated by mapaccess2_faststr, your string keys are the bottleneck, not the map itself. Swap to mapaccess2_fast64 by using int-ish keys and you’ll often see a huge win.
A few other things I’ve learned to look for in pprof’s disasm view:
- Bounds checks (CMP followed by JAE and a call into
runtime.panicIndex): Go does bounds checks on every slice access, and the compiler tries to hoist them out of loops. If you see repeated bounds checks in a loop, sometimess := s[:len(s)]before the loop or a range-with-index helps. - Interface conversions (
runtime.convT2Estring,runtime.convT2Enoptr): means you’re boxing a concrete type into an interface value somewhere, which allocates. - Write barriers (
runtime.gcWriteBarrier): you’re doing a pointer write that the GC needs to observe. Usually fine, but a hot loop with many pointer writes can eat real CPU.
Reading Go assembly is surprisingly approachable. The calling convention is simple, the register pressure is manageable, and the one-function-per-disasm-block view in pprof is clear. I don’t claim to write Go assembly, but reading a hot 40-instruction block now takes me a couple of minutes instead of being a wall of noise.
This is the post I’d have given past-me to save myself an afternoon of squinting. Pprof at the function level tells you where. Pprof at the disasm level tells you why. And for hot paths, the why is often a tiny specialization question like “strings vs ints” that you can only see once you’ve looked at the generated code.