A Heisenbug in a Go channel close that took me two weeks

This bug started as a mildly flaky test. It ended as a load-bearing lesson in what “don’t close a channel from the receiver” actually means. Two weeks of my life.

The symptom

A Go service would intermittently panic with:

panic: send on closed channel

About once a day in production, randomly. No obvious trigger. The stack trace pointed at a channel used to coordinate shutdown between a worker pool and the supervisor goroutine.

The initial theory

I assumed there was a race between the supervisor closing the channel (during shutdown) and a worker sending on it. The standard fix is to have only one side close the channel, and to coordinate shutdown via a context or a separate done channel.

I added that coordination. The panic kept happening.

The second theory

I figured I must have missed a call site. Ran rg 'close\(.*channel' --type go everywhere. Found three places. One was intentional. One was dead code. One was… suspicious.

The suspicious one was in a retry wrapper that closed the channel on exhaustion of retries. Removed it. Panic kept happening.

The logging trick

At this point, in desperation, I added a log statement right before every send and every close:

log.Printf("sending on ch, goroutine=%d", runtime.Goid())
ch <- work

The panic stopped happening. Completely. I shipped the code with the logging. Panic gone for a week.

What was actually wrong

The logging was adding enough latency to hide a race. Specifically: two goroutines were simultaneously doing the close-after-drain dance, and without the logging, the close could race with an in-flight send. The logging slowed down the send path by a microsecond or two, which changed the scheduling enough to hide the race in practice.

The underlying bug: a sync.Once was being used to close the channel, but the Once’s Do was called from multiple goroutines, one of which could have been in the middle of <-ch reception when the close happened, while another was still about to send.

The Go memory model doesn’t guarantee anything about the order of these events except within a happens-before relationship. Our code had no happens-before between “finished sending” and “closing the channel.” The Once protected against double-close, but not against close-before-last-send.

The actual fix

Two patterns that I now treat as non-negotiable:

Never close a channel from the receiver. Only the sender closes. If there are multiple senders, coordinate so only one closes, and only after all others have committed to no more sends.
Signal “no more work” with a context, not a close. Workers range over ch until the channel is closed, but the channel is only closed after the senders have all exited. The senders exit when the context is cancelled.

Rewritten:

ctx, cancel := context.WithCancel(parent)
var senders sync.WaitGroup

senders.Add(numSenders)
for i := 0; i < numSenders; i++ {
    go func() {
        defer senders.Done()
        for {
            select {
            case <-ctx.Done():
                return
            case ch <- produceWork(ctx):
            }
        }
    }()
}

go func() {
    senders.Wait()
    close(ch)
}()

// consumers
for w := range ch {
    handle(w)
}

This pattern closes the channel exactly once, after all sends are guaranteed complete. The ctx.Done() is the signal for “stop producing.” The close(ch) is the signal for “no more work, exit the range loop.”

The tooling

I wasted several days before I reached for go run -race. I was running it in CI, but my local runs weren’t using it. -race would have caught this in an afternoon. I now have a make test-race target and a pre-commit hook reminder.

go test -race -timeout 60s ./...

The race detector is one of the best debugging tools in Go. It finds these things. Use it.

Reflection

The “my bug went away when I added logging” experience is humbling every time. The instinct is to say “it’s fine now.” It’s never fine now. It’s fine until something else changes — a scheduler update, a different goroutine count, a CPU with different memory ordering — and then it’s catastrophic.

The honest work is to find why it went away. In this case, adding logging changed goroutine scheduling enough to avoid the race. The race was still there. Ship logging as a fix and you’ve just put the bug on a delay timer.

Lesson number two: close channels from the sending side. Every Go tutorial says this. I’ve internalized it more deeply now.

Related: Flaky tests triage workflow.