Flamegraphs in production without the fear

For years my relationship with flamegraphs was: run them in dev, hope the shape looked the same in prod, complain when it didn’t. Something was always different — real concurrency, cache behavior, a CDN quirk, actual user patterns. The flamegraph I wanted was the one I could never get.

Then last year I bit the bullet and started running perf record on a prod box for 30 seconds at a time. It was fine. Here’s what I learned.

The actual risk profile

perf record in its default mode does sampling profiling via hardware performance counters. At default frequency (99Hz), the overhead is genuinely tiny — I’ve measured it at well under 1% on our API boxes. The scary overhead numbers you sometimes see are from tracing all syscalls or using kprobes, which is a different tool.

The real risk isn’t CPU. It’s:

Disk: perf record writes to a perf.data file. At high frequency on a box with lots of cores, this can be hundreds of MB for a 30-second capture.
Stack unwinding: DWARF-based unwinding can be heavy if you don’t have frame pointers compiled in. Switching to --call-graph fp fixes this for Go binaries built with frame pointers (the default on go1.7+).
Symbol resolution: happens later during perf script and only on whatever machine you run that on. You need the binary + debug info. This is where most people get stuck.

The workflow I use

On the prod box:

# 30s capture, fp unwinding, CPU cycles
sudo perf record -F 99 -a --call-graph fp -o /tmp/perf.data -- sleep 30

# compress and grab
tar cJf /tmp/perf.tar.xz -C /tmp perf.data

Pull the tarball off the box. On my laptop, with a matching copy of the binary:

perf script -i perf.data > out.stacks
./FlameGraph/stackcollapse-perf.pl out.stacks > out.folded
./FlameGraph/flamegraph.pl out.folded > flame.svg

The Brendan Gregg FlameGraph repo is the standard. Works fine. I also sometimes use inferno-flamegraph when I want Rust-colored output.

For Go specifically, net/http/pprof is often easier — you get a single HTTP endpoint, you grab a profile, go tool pprof -http=:8080 cpu.prof and you have an interactive view. I use pprof for Go and perf for C extensions, Python, or anything system-level.

What I actually find when I do this

About 70% of the time: nothing surprising. The flamegraph matches what the code obviously looks like. CPU is in the places I’d guess.

About 20% of the time: a surprising line in a library. Last one: 18% of CPU on a machine was inside the logging library’s JSON formatter because we had a log line in a hot loop that serialized a large struct. Easy fix, big win.

About 10% of the time: something really weird. One time I found a third of CPU in futex calls because a memcache client was doing per-request allocator locking. Fixed by reusing clients.

The small-but-real category is runtime/GC. Both Go and Python have characteristic flamegraph shapes when you’re under memory pressure (runtime.mallocgc or PyEval_EvalFrame with GC frames). Learn to spot them.

On-prem vs cloud

On cloud, you might not have perf at all, depending on the distro and kernel. You definitely don’t have CAP_SYS_ADMIN by default in containers. Depending on your orchestrator you may need to mount /sys/kernel/debug or run the collector as a privileged DaemonSet. We do this for our Kubernetes nodes — a DaemonSet that ssh’s to the node and runs perf record on demand via a Slack slash command.

# on the node:
sudo perf record -F 99 -p <pid> --call-graph fp -o /tmp/pid.data -- sleep 30

When you don’t have perf, async-profiler for JVMs or py-spy for Python are excellent and run in userspace without special capabilities.

The permission question

Most teams’ first reaction is “we don’t want to run profilers on prod.” I get it. A few counter-arguments:

Your APM already runs in prod and has more overhead than perf.
You can target one canary box, not the fleet.
If you can’t get data from prod, you don’t actually know where your time is going. You’re guessing.

I run profiles on prod about once a month. Maybe once a quarter I catch something real. The cost is ~5 minutes of my time and zero measurable impact on users. The value is enormous relative to staring at APM dashboards.

Reflection

The thing that changed my mind was realizing the mythology around “production profiling is dangerous” is mostly old wisdom from when profiling was dangerous. Modern perf is cheap, well-behaved, and the kernel has done the hard work of making it safe. The tools are fine. Go use them.

Related: gdb remote core dump workflow covers the next step when profiling points you at something genuinely weird.