Reading goroutine traces like a local
Abstract
pprof gets all the attention, and it is the right first tool when a service is “slow” in the sense of “CPU is pegged.” But when the service is slow in the sense of “everything is fine according to pprof, tail latency is still awful,” the answer is almost always in the runtime trace. The problem is the trace viewer looks like a fighter jet cockpit and nobody teaches you how to fly it.
This was a 25-minute meetup talk where I walked through a real trace from a service I had been debugging the week before. No slides of theory, just a screen share of go tool trace and a running commentary.
Outline
- pprof vs trace: when each one is the right tool
- Capturing a trace in production without burning the server
- The six views in
go tool trace: which ones are useful and which ones you can ignore - Reading a goroutine’s timeline: scheduling, syscalls, network waits
- The pattern I care about most: a goroutine that is runnable but not running
- The pattern that bit us: channel send blocking on a slow consumer, invisible in pprof
- Questions
What I learned giving it
The live-demo part was the right call. Half the audience said afterwards they had tried to read a trace once, bounced off, and never opened it again. Watching someone else zoom and pan through the view seems to be the unlock.
The meta-lesson: I spent three slides on GOMAXPROCS and goroutine scheduling theory that nobody needed. If the audience has written Go for more than a year, they know what a P is. Skip it.
What I’d change
- Record the screen instead of presenting it live. The zoom levels in the trace viewer are finicky at conference projector resolution.
- Have a second, shorter trace ready. The first one ran long because I kept explaining small features I noticed mid-demo.
- Include a cheat-sheet PDF with the keyboard shortcuts. The number of people who did not know W/A/S/D zooms the timeline was high.
Related posts: /posts/reading-the-go-scheduler-traces-for-the-first-time/, /posts/pprof-flamegraphs-in-prod/, /posts/the-goroutine-leak-i-didnt-notice-for-six-weeks/.
Not recorded, it was a bar-room meetup. The slides PDF is linked above; it is mostly screenshots without my narration so it is only half useful on its own.