strace revealed our libc mismatch

Same binary. Same config. Runs fine on our old Ubuntu 20.04 image, crashes immediately on our new Ubuntu 22.04 image. No log output at all.

Not even “the process started.” Just exit code 1.

The investigation

./service --version also exited 1. echo $? said 1. strace ./service --version:

execve("./service", ["./service", "--version"], 0x...) = -1 ENOENT (No such file or directory)

ENOENT on a binary that clearly exists? That’s a classic “shared library is missing or wrong” result from the dynamic linker. But the error from the shell doesn’t say that. strace does.

Drilling in with ldd:

$ ldd ./service
./service: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.38' not found (required by ./service)

The binary was built against glibc 2.38 (on a dev machine). The Ubuntu 22.04 image ships with glibc 2.35. ldd told me this in one line, but I’d never run ldd because I assumed “binary runs on my laptop, must be fine.”

What made this hard

Normally the dynamic linker error prints a message. Something like:

./service: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.38' not found ...

…on stderr, before the process exits. For some reason (we never fully figured out) the container’s init was swallowing stderr of the main process. The container just ended with exit code 1 and no output. So the actionable error message was being produced, and it was being filtered out before we saw it.

Running with strace made it visible because strace prints syscall-level stuff to its own output, which was plumbed correctly.

The fix

Two fixes, depending on urgency.

Short-term: build the service against the older glibc. In our case, the build image was newer than the runtime image. We pinned both to Ubuntu 22.04 and the problem went away.

Long-term: static link where possible. Our service is a Go binary, so CGO_ENABLED=0 go build -o service produces something that has no libc dependency at all. After that, any base image works, including scratch. For services that genuinely need cgo (we have a few that use libsqlite with FTS5), we use a musl-based build.

What strace is actually good for

The pattern I keep coming back to: strace is perfect when the complaint is “thing doesn’t work and there’s no useful output.”

Service exits silently: strace shows the last syscall. Often enough.
Service hangs: strace -p $PID shows what syscall it’s blocked on.
Service is slow and you don’t know where: strace -c summarizes syscall counts and time, great for “spending 40% of time in futex” type findings.
Config file not being read: strace the open() calls, find out what path the service is actually looking at (often not what you think).

A short tour of useful flags:

# attach to a running process
strace -p 12345

# follow children (useful for shell scripts / init-like processes)
strace -f ./service

# only file-related syscalls, with full paths
strace -e trace=file ./service

# summary table of syscalls
strace -c ./service --help

# write to a file instead of stderr (useful for long-running traces)
strace -o trace.log -f -T ./service

When strace is not the right tool

When you need to not slow the process down (strace can be 50x slowdown on heavy workloads). Use perf or eBPF-based tools like bpftrace instead.
When you want to trace across processes or the whole system. Use perf trace or bpftrace.
When the issue is inside a single syscall (e.g., epoll_wait that never returns). strace just shows you the syscall started.

For those, I use bpftrace scripts or perf. But for “why did my binary exit silently,” strace is still my first move.

Reflection

I’ve been doing this long enough that “what does strace say” is basically muscle memory for certain categories of bug. What surprised me here wasn’t strace being useful — it was how invisible the problem was without it. We had an error message on stderr, and container stdout/stderr handling ate it. Modern container runtimes are supposed to be better about this; ours was not. Now I reach for strace earlier than I used to.