The panic in a goroutine that took down prod

Here’s a trap I’ve seen Go engineers fall into repeatedly: they spawn a goroutine thinking “if this fails, it’ll just fail in its own little world, and the main service keeps going.” Nope. A panic in any goroutine crashes the entire process unless you recover it locally.

The code that took us down looked roughly like this:

func (s *Service) Start() error {
    go s.runPeriodicCleanup()
    go s.runMetricsExporter()
    return nil
}

func (s *Service) runPeriodicCleanup() {
    for {
        time.Sleep(5 * time.Minute)
        s.cleanupExpired()
    }
}

Looks innocuous. runPeriodicCleanup has no error handling because, well, it shouldn’t error. Until s.cleanupExpired() accidentally dereferences a nil pointer because a config field wasn’t initialized, and the whole process goes down with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x...]

goroutine 23 [running]:

Our SRE pinged me on oncall — we’d gone from serving traffic fine to all pods crashlooping in about 40 seconds. The panic was deterministic — every replica hit the same bug immediately after startup, in the background goroutine.

Go’s panic/recover model is: a panic in one goroutine will propagate up that goroutine’s stack, and if not recovered, crashes the whole process. There is no “one goroutine dies but the rest keep going” — Go isn’t Erlang.

The fix is explicit recover at the top of every goroutine that you don’t want to take down the process:

func (s *Service) Start() error {
    go s.safeRun("cleanup", s.runPeriodicCleanup)
    go s.safeRun("metrics", s.runMetricsExporter)
    return nil
}

func (s *Service) safeRun(name string, fn func()) {
    defer func() {
        if r := recover(); r != nil {
            slog.Error("goroutine panicked",
                "name", name,
                "panic", r,
                "stack", string(debug.Stack()))
            panicCounter.WithLabelValues(name).Inc()
        }
    }()
    fn()
}

A few additional considerations once you start doing this:

The goroutine dies after recover. The recovered goroutine doesn’t resume — recover lets the process continue, but the goroutine’s frame has unwound. If you want the work to keep happening, you need to restart it. Usually I wrap with a restart loop:

func (s *Service) safeRunForever(name string, fn func()) {
    for {
        func() {
            defer func() {
                if r := recover(); r != nil {
                    slog.Error("goroutine panicked, restarting",
                        "name", name,
                        "panic", r,
                        "stack", string(debug.Stack()))
                }
            }()
            fn()
        }()
        // restart after a brief backoff
        time.Sleep(1 * time.Second)
    }
}

The inner anonymous function creates a new recoverable scope each iteration, so a panic crashes just one iteration.

Don’t swallow panics silently. Always log and increment a counter. In production, I’ve seen silent recover dedicating so much CPU to recovering-then-failing-again that the service was essentially dead but appeared “up.” A metric on panic count is essential.

Don’t recover everywhere. Recover belongs at the top of a goroutine or at a well-defined service boundary (like an HTTP middleware). Don’t scatter defer recover() through your code — it makes debugging impossible and hides real bugs.

HTTP middleware should recover. net/http already has this built in — if your handler panics, the server’s default serverHandler recovers and returns 500. But in some frameworks (I’m looking at you, certain gRPC handlers), panics in handlers are NOT recovered by default. Check your framework.

Goroutine leaks aren’t panics. A leaked goroutine — one that’s blocked forever — doesn’t crash anything, but it does eat memory. Different problem, different solution. See my post on goroutine leaks.

fatal error: concurrent map writes is NOT a panic. The Go runtime detects some classes of misbehavior and calls panic in a way that cannot be recovered. Concurrent map writes is one. Stack exhaustion is another. recover() doesn’t save you from these — the runtime bypasses the normal panic mechanism. The only fix is to not hit them in the first place.

os.Exit is not a panic. If you (or a library) calls os.Exit(1), no deferred functions run and there’s no recovery. I’ve seen libraries call log.Fatal inside goroutines, which calls os.Exit. That’s not recoverable. log.Fatal in library code is always a bug.

After the incident, I added a lint rule to our codebase: every go statement must be wrapped in a function that starts with a defer recover() (we implemented it with go keyword grep plus a manual review). It’s not bulletproof — you can still start a goroutine in cgo or via a library that doesn’t recover — but it catches 95% of cases.

The irony is that this bug had been latent for months. The background cleanup was only occasionally active, and the nil path was a config flag being misconfigured in staging that nobody had caught. A production config push brought all the replicas into the bad state simultaneously. We had great recover in our HTTP middleware, so all user-facing panics were contained. The background worker had none, and it was the background worker that killed us.

If you’re running Go in production, take a few minutes and audit every go keyword in your codebase. Any goroutine you start that you expect to live for the life of the process should have a recover. Library code should probably NOT recover (let the caller decide), but the code in your own main binary absolutely should.