The goroutine leak I didn't notice for six weeks

We’d been seeing a slow memory creep in our billing service for about a week when Priya, who was oncall that rotation, started asking pointed questions in the channel. The graph wasn’t dramatic. It was just up and to the right, a couple hundred MB a day, reset every time we deployed. Because we deploy several times a week, nobody had noticed it for about six weeks, until a holiday deploy freeze forced the pod to finally OOM.

The first thing I did was pull a heap profile. It was boring. A lot of []byte buffers from our protobuf decoder, which I’d seen before and knew was fine. The numbers didn’t add up to what the container was reporting, either — the RSS was way bigger than the heap. That’s usually a sign that either you’ve got a native allocator misbehaving, or you’ve got more goroutines than you think.

curl localhost:6060/debug/pprof/goroutine?debug=1 and there it was: 184,000 goroutines, all parked in the same place. Every single one blocked on a send to a channel that used to be drained by a worker that we’d deleted four releases ago.

The offending code looked something like this:

type AuditLogger struct {
    events chan Event
}

func NewAuditLogger() *AuditLogger {
    return &AuditLogger{
        events: make(chan Event, 256),
    }
}

func (a *AuditLogger) Record(ctx context.Context, ev Event) {
    go func() {
        a.events <- ev
    }()
}

The Record function spawns a goroutine to do the send so the caller is never blocked. That was the “feature”. When the channel fills up (because nobody is reading), each new Record call just parks a goroutine forever waiting to send. The fact that they’re parked means they live forever. And since we log audit events on every request, we leak at roughly request-rate.

There’s a bunch of ways to fix this. The smallest is to accept that sometimes you drop events:

func (a *AuditLogger) Record(ctx context.Context, ev Event) {
    select {
    case a.events <- ev:
    default:
        // dropped, bump a counter
        droppedEvents.Inc()
    }
}

But honestly, the real fix is to not have a hidden worker channel at all. If nothing reads it, why is it here? When I went digging, the reader had been removed in a cleanup PR — the events channel was supposed to be deleted too, but the author had missed one reference, and Go, being Go, doesn’t warn you about a channel with no receivers. Unused types, unused variables, and unused parameters all get flagged, but a live channel that nobody reads is just fine as far as the compiler is concerned.

A few lessons that I keep trying to internalize:

Always alert on goroutine count, not just memory. The memory was a lagging indicator; the goroutine count would have told us the day it started. Our Prometheus setup now scrapes go_goroutines and we alert on sustained growth over a six-hour window. Simple rule, has paid for itself twice.
go func() { ch <- x }() is almost always wrong. If you find yourself writing this, you’re trying to fire-and-forget into a channel, which means you should either (a) use a buffered channel with a select and a drop, or (b) not use a channel at all. I’ve started treating it as a code review red flag.
pprof goroutine is shockingly good. I’d been using it for months, but I’d never really read the output carefully. Each group is labeled by stack, so 184,000 goroutines parked at the same line of code show up as a single stanza with a count in front. It’s almost too easy once you see it.

The fix itself was one line — switching to a non-blocking send. The PR took five minutes. The embarrassment that it had been quietly eating our memory for six weeks took longer to get over. If you’re running a Go service that you’ve been making changes to, just go look at /debug/pprof/goroutine?debug=1 right now. It’ll take 30 seconds, and you might be surprised what you find.

For anyone doing this seriously, I’ve got more notes in my post on pprof graphs on reading profiles in general. But for this specific class of bug, the goroutine profile is the only tool you need.