Debugging a remote core dump without losing your mind
A service in production crashes. You have the core dump. You’re on your laptop, not on the prod machine. The prod machine runs a different distro than you do. The binary is stripped. Nothing in the error message is useful. Now what.
This is a run-through of the process I follow, in more detail than I usually see written down.
Step 1: grab the ingredients
You need four things. Losing any one of them makes the next several steps much harder.
- The core file itself.
- The exact binary that produced it.
- Any dynamic libraries loaded at the time of the crash.
- Debug info, ideally in a separate
.debugfile.
The binary on the production host might already be different from what produced the core if you’ve since redeployed. Check the build/commit hash against the core’s recorded info:
# for a process you still have, to get the same build:
strings /proc/$PID/maps | head
file ./mybinary
./mybinary --version
For the core itself, file core will tell you what executable it came from and approximately when.
Step 2: get the libraries
The easy mistake is copying the core to your laptop and trying to open it against your laptop’s binary. Library versions don’t match. gdb bails.
What I do: on the prod host, before leaving:
# pull the actual libraries the process loaded
cat /proc/$PID/maps | awk '{print $6}' | sort -u | grep -v '\[' > /tmp/libs.txt
tar cjf /tmp/crash.tar.bz2 -T /tmp/libs.txt ./mybinary core
If the process is already dead, you can extract the set of needed libraries from the core itself:
gdb ./mybinary core -ex "info shared" -ex quit 2>&1 | awk '/0x/ {print $NF}'
Step 3: set up sysroot
Now I set up a directory that mimics the prod host’s layout:
mkdir -p sysroot/lib64 sysroot/usr/lib64
# untar the collected libs preserving paths
tar xjf crash.tar.bz2 -C sysroot/
Then tell gdb about it:
(gdb) set sysroot ./sysroot
(gdb) file ./sysroot/path/to/mybinary
(gdb) core-file core
If you get a warning about .dynamic section, you’ve missed a library. Iterate.
Step 4: the actual debugging
bt full is the 80/20 tool:
(gdb) bt full
#0 0x00007f... in handle_request (req=0x...) at src/handler.c:142
buf = 0x7f... "{\"type\":\"signup\", ...}"
n = 8192
i = 4096
Things I look for:
- Is there a null pointer in the locals?
req=0x0is a gift. - Is there a string that looks like user input? That might be the trigger.
- Is the crash in someone else’s library, or ours? If it’s in ours, that’s easier.
- The stack depth itself. A 200-frame stack is usually a recursion bug.
Step 5: go three levels up
Backtrace tells you where it crashed. It doesn’t tell you why the pre-crash state got there. I always move up a few frames and poke around:
(gdb) frame 3
(gdb) print *some_struct
(gdb) print some_array[0]@10
The @N syntax prints N elements starting at the pointer. Very handy for dumping buffers.
For a recent crash, the stack had a value of buf_len = 8192 but the buffer was declared as char buf[1024]. Classic buffer overflow. Walking up frames to find what wrote to it revealed a function that assumed the caller had allocated 8KB.
Step 6: when the binary is stripped
Usually the answer is “debug info got stripped out, you need the separate debuginfo package.” If you have it:
(gdb) set debug-file-directory ./debuginfo
(gdb) symbol-file ./sysroot/path/to/mybinary
If you don’t, you can often reproduce the issue against a debug build and attach that symbol table:
(gdb) symbol-file ./mybinary.debug
…as long as the build ID matches. file and readelf -n on both binaries will show you. Stripping only removes symbols, not the actual code layout, so the debug file from a matching build will still line up.
Step 7: post-mortem notes
After every core dump investigation I now:
- Save the
bt fulloutput somewhere. - Note which frame was the first frame in our code.
- Write a single sentence describing the cause.
Over time this becomes a catalog that’s searchable. “Have we seen this before?” is a useful question that’s hard to answer without a searchable archive.
Reflection
Core dumps intimidated me for years. The intimidation is entirely about the setup — getting the binary, the libs, the sysroot. Once you’re in gdb with matching symbols, it’s not harder than reading a stack trace in any other context. It’s just an older interface.
If your service produces core dumps (or even if it doesn’t — consider enabling it), the ten minutes you spend setting up the collection pipeline will pay itself back the first time you need it.
Related: strace revealed the libc problem.