A compiler optimization I didn't know existed until it broke
A homelab project of mine — a small Lisp interpreter in C — started crashing under deep recursion after I switched from -O1 to -O2. It took me three days to figure out that a compiler optimization I didn’t know existed was interacting with a recursion pattern I’d written, and that the crash was not a bug in my code so much as a lesson about the assumptions I was making about what the compiler could and couldn’t do.
The code was a classic pattern for interpreters: a eval function that dispatches on the expression type and recurses. For most forms, the recursion goes through several stack frames. For some forms — tail-position continuations — the recursion could be arbitrarily deep. I’d written the interpreter in what I thought was idiomatic C:
Value eval(Env *env, Expr *e) {
switch (e->type) {
case EXPR_NUMBER: return number_value(e->num);
case EXPR_VAR: return env_lookup(env, e->name);
case EXPR_IF: {
Value cond = eval(env, e->cond);
if (is_truthy(cond)) {
return eval(env, e->then);
} else {
return eval(env, e->else_);
}
}
case EXPR_CALL: {
Value fn = eval(env, e->fn);
// evaluate args...
return call(fn, args);
}
// ...
}
}
At -O1, this worked fine. At -O2, a tight loop in my test suite caused a segfault. The stack was massive — the crash was at the very bottom of 200,000 nested calls.
I added debug prints, watched the stack grow, couldn’t see anything unusual. The difference between -O1 and -O2 was… the optimizer. Eventually, I remembered that -O2 enables sibling call optimization (a weaker form of tail call optimization). And indeed, when I looked at the generated assembly:
# -O1: recursive call preserves the frame
call eval
# -O2: turns into a tail jump
jmp eval
At -O2, the tail call to eval in the EXPR_IF case became a direct jmp, which reused the stack frame. Great — that’s tail call optimization, and it should mean I get unlimited recursion depth in that path.
Except… the crash was still happening. What gives?
Here’s what I missed: tail call optimization in C is not guaranteed. The C standard doesn’t require it. GCC and Clang do it sometimes, when they can prove it’s safe. In my case, the switch dispatch structure was preventing TCO on some paths but not others. The EXPR_IF tail call was optimized. The EXPR_CALL tail call… was not, because the compiler couldn’t prove the args array’s lifetime was okay to discard.
Where I got confused was that I thought “the optimizer makes things faster.” I didn’t realize the optimizer can change the stack usage, AND can make some paths tail-call and some paths not, depending on subtleties I couldn’t see. My test suite was crashing because the pathological test case happened to hit the non-TCO path a lot.
The fix was to convert to an explicit trampoline:
// Explicit tail call via trampoline
Value eval(Env *env, Expr *e) {
for (;;) {
switch (e->type) {
case EXPR_NUMBER: return number_value(e->num);
case EXPR_VAR: return env_lookup(env, e->name);
case EXPR_IF: {
Value cond = eval(env, e->cond);
// explicit tail: rebind env and e, loop
e = is_truthy(cond) ? e->then : e->else_;
continue;
}
case EXPR_CALL: {
// for tail calls, rebind env and e, loop
Value fn = eval(env, e->fn);
// ... evaluate args ...
if (is_tail_call(e)) {
env = new_env;
e = fn->body;
continue;
}
return call(fn, args);
}
}
}
}
Now the tail positions are explicitly looped — no recursion, no stack growth. Deep recursion is bounded by heap (the environment chain), not by the C stack. This is how every serious Scheme interpreter does it.
The broader lesson: assume the compiler doesn’t optimize tail calls in C. Some languages — Scheme, OCaml, Scala (in certain cases), Haskell — guarantee it. Others — C, C++, Rust, Go — do not. If you care about stack depth, you write the trampoline yourself.
Other compiler optimizations I’ve been bitten by, briefly:
Strict aliasing. -O2 turns on strict aliasing by default. If you cast a float* to an int* and read through it, the compiler can assume they don’t alias and reorder reads/writes. The fix is char* (the universal alias type), -fno-strict-aliasing, or memcpy.
Reordering of atomics with nearby non-atomic ops. The compiler can reorder an atomic around non-atomic operations in ways that feel wrong. In C11, you need the right memory_order to prevent this. memory_order_relaxed allows essentially arbitrary reordering.
Empty infinite loops getting removed. C++ allows the compiler to assume loops terminate (for compiler-optimization purposes). An empty while (1) {} can be compiled to… nothing. C is more conservative, but still sometimes surprising. Always have an observable effect in your loop.
Undefined behavior triggering weird paths. If your code has UB in it (signed overflow, null deref, etc.) the compiler is allowed to assume the UB never happens and optimize based on that. This leads to “the compiler ate my bounds check” cases where a check after a dereference gets deleted because the compiler reasoned: if the dereference didn’t crash, the pointer wasn’t null, so the null check is dead.
Inlining is context-sensitive. A function might be inlined in one call site and not another. Performance can depend on things like call count, size budget, and compiler version. The inline keyword is a hint, not a guarantee. __attribute__((always_inline)) is a stronger hint.
LTO changes what’s possible. Link-time optimization lets the compiler see across compilation units, which enables more aggressive optimization — and sometimes more surprising ones. The first time my non-LTO build and LTO build had different behavior, I was baffled. It was almost always strict-aliasing related.
The meta-lesson after this interpreter incident: my mental model of “what the compiler can do” was about a decade out of date. Modern compilers can do far more than I gave them credit for, and the resulting code is often faster than I’d expect — but also sometimes behaves differently from a naive reading of the source. For safety-critical or performance-critical code, I now make it a habit to look at the generated assembly with cargo asm (Rust) or objdump -d (C) to verify my assumptions.
Homelab Lisp interpreter now handles deep recursion and runs at roughly 70% of the speed of the non-trampoline version. That’s a cost, but it’s a cost I can afford, and the explicit trampoline is easier to reason about than “hope the compiler does TCO.”
Related: the branch predictor post and CPU caches post for other ways the hardware and compiler surprise me.