strace on a running production process, carefully

strace-on-production is one of those phrases that makes people wince, and fair enough: if you attach without filters to a busy process you can absolutely make it slower than it was. But with care it’s one of the best tools we have.

The process I’m attaching to here is imagepipe, a service that reads uploads from a temp directory and pushes them to redis for a worker pool. It’s been eating too much CPU. I attach with strace -f -p 4421 -e trace=network,openat -tt -s 120 and pipe through head -40 so I don’t end up in a firehose I can’t read.

Notice the pattern in the output. For every job: openat the file, read it, connect to redis, sendto LPUSH, recvfrom the reply, close the file, close the socket. Open. Read. Connect. Close. Open. Read. Connect. Close. The redis connection is being opened and torn down for every single job, and then the file is being opened, read, and closed one job at a time.

My conclusion (the commented-out shell line at the bottom) is the bug in one sentence: “we never cache the fd.” A pooled redis client and a small LRU for open files would cut the syscall rate by 10x.

Callouts:

-f attaches to all threads and forks. Without it you only see the main thread, which is often useless.
-e trace=network,openat restricts the calls to what I care about. On a high-throughput process this is the difference between “useful” and “unusable.”
-tt timestamps with microseconds, which makes it obvious how quick the loop is.

Always detach cleanly (Ctrl-C is fine) and never leave strace running on a hot path.