A service in production crashes. You have the core dump. You’re on your laptop, not on the prod machine. The prod machine runs a different distro than you do. The binary is stripped. Nothing in the error message is useful. Now what.

This is a run-through of the process I follow, in more detail than I usually see written down.

Step 1: grab the ingredients

You need four things. Losing any one of them makes the next several steps much harder.

  1. The core file itself.
  2. The exact binary that produced it.
  3. Any dynamic libraries loaded at the time of the crash.
  4. Debug info, ideally in a separate .debug file.

The binary on the production host might already be different from what produced the core if you’ve since redeployed. Check the build/commit hash against the core’s recorded info:

# for a process you still have, to get the same build:
strings /proc/$PID/maps | head
file ./mybinary
./mybinary --version

For the core itself, file core will tell you what executable it came from and approximately when.

Step 2: get the libraries

The easy mistake is copying the core to your laptop and trying to open it against your laptop’s binary. Library versions don’t match. gdb bails.

What I do: on the prod host, before leaving:

# pull the actual libraries the process loaded
cat /proc/$PID/maps | awk '{print $6}' | sort -u | grep -v '\[' > /tmp/libs.txt
tar cjf /tmp/crash.tar.bz2 -T /tmp/libs.txt ./mybinary core

If the process is already dead, you can extract the set of needed libraries from the core itself:

gdb ./mybinary core -ex "info shared" -ex quit 2>&1 | awk '/0x/ {print $NF}'

Step 3: set up sysroot

Now I set up a directory that mimics the prod host’s layout:

mkdir -p sysroot/lib64 sysroot/usr/lib64
# untar the collected libs preserving paths
tar xjf crash.tar.bz2 -C sysroot/

Then tell gdb about it:

(gdb) set sysroot ./sysroot
(gdb) file ./sysroot/path/to/mybinary
(gdb) core-file core

If you get a warning about .dynamic section, you’ve missed a library. Iterate.

Step 4: the actual debugging

bt full is the 80/20 tool:

(gdb) bt full
#0  0x00007f... in handle_request (req=0x...) at src/handler.c:142
    buf = 0x7f... "{\"type\":\"signup\", ...}"
    n = 8192
    i = 4096

Things I look for:

  • Is there a null pointer in the locals? req=0x0 is a gift.
  • Is there a string that looks like user input? That might be the trigger.
  • Is the crash in someone else’s library, or ours? If it’s in ours, that’s easier.
  • The stack depth itself. A 200-frame stack is usually a recursion bug.

Step 5: go three levels up

Backtrace tells you where it crashed. It doesn’t tell you why the pre-crash state got there. I always move up a few frames and poke around:

(gdb) frame 3
(gdb) print *some_struct
(gdb) print some_array[0]@10

The @N syntax prints N elements starting at the pointer. Very handy for dumping buffers.

For a recent crash, the stack had a value of buf_len = 8192 but the buffer was declared as char buf[1024]. Classic buffer overflow. Walking up frames to find what wrote to it revealed a function that assumed the caller had allocated 8KB.

Step 6: when the binary is stripped

Usually the answer is “debug info got stripped out, you need the separate debuginfo package.” If you have it:

(gdb) set debug-file-directory ./debuginfo
(gdb) symbol-file ./sysroot/path/to/mybinary

If you don’t, you can often reproduce the issue against a debug build and attach that symbol table:

(gdb) symbol-file ./mybinary.debug

…as long as the build ID matches. file and readelf -n on both binaries will show you. Stripping only removes symbols, not the actual code layout, so the debug file from a matching build will still line up.

Step 7: post-mortem notes

After every core dump investigation I now:

  1. Save the bt full output somewhere.
  2. Note which frame was the first frame in our code.
  3. Write a single sentence describing the cause.

Over time this becomes a catalog that’s searchable. “Have we seen this before?” is a useful question that’s hard to answer without a searchable archive.

Reflection

Core dumps intimidated me for years. The intimidation is entirely about the setup — getting the binary, the libs, the sysroot. Once you’re in gdb with matching symbols, it’s not harder than reading a stack trace in any other context. It’s just an older interface.

If your service produces core dumps (or even if it doesn’t — consider enabling it), the ten minutes you spend setting up the collection pipeline will pay itself back the first time you need it.

Related: strace revealed the libc problem.