Structured logging lessons from four years of zerolog

We’ve been running zerolog in production Go services for about four years. Structured logging sold itself. Every log line is JSON, fields are typed, parsing is free, Kibana queries are sane. What nobody told me: you don’t get the benefit for free. You get it through discipline.

What I’d do the same

Typed fields everywhere. logger.Info().Str("user_id", id).Dur("latency", d).Msg("request complete") is a lot nicer than log.Printf("request complete user=%s latency=%v", id, d). Consumers can filter on latency > 1s without regex. Field names standardize over time.

A single logger with a context. We inject a *zerolog.Logger into every request context. Every log line from that request is tagged with request ID, tenant ID, and route. You can filter “everything that happened during this request” trivially. This is the most-used debugging feature we have.

Log levels used properly. DEBUG for “interesting to developers during feature work, muted in prod.” INFO for “normal operation events worth capturing.” WARN for “something is wrong but we handled it.” ERROR for “something is wrong and we didn’t handle it.” No level shouting match.

What I’d do differently

Don’t log inside tight loops. Even with structured logging, the cost of logger.Debug().X().Msg(...) is not zero. In a hot path we had a per-iteration Debug call that, once we enabled debug logging for troubleshooting, added ~15% CPU overhead. The fix was a sampled logger — only log 1 in 100 iterations.

if rand.Intn(100) == 0 {
    logger.Debug().Int("iter", i).Msg("progress")
}

Sampling debug in prod is essential. We eventually wrapped zerolog in a “sampled” logger that emits at most N messages per minute per log message template. Prevents accidental log explosions when a hot path starts erroring.

Don’t log full request/response bodies by default. One team decided it was helpful to dump POST bodies on errors. Worked great until someone sent a 40MB PDF upload and we got 40MB log lines that crashed our log shipper. We now truncate any field larger than 2KB, and it’s a pattern we enforce via a linter.

Context key conventions. We started without conventions. Every service decided whether to use user_id or userId or uid. This haunted us for a year as we tried to write dashboards that worked across services. We now have a document called log-fields.md with the canonical names and types. New fields require a PR.

The log volume surprise

Structured logs are more verbose than line-based logs. JSON has overhead. Four years of “add a field just in case” adds up. Our log volume grew faster than our traffic did.

We now have a quarterly “log field review” where we look at field usage from the last 90 days and drop fields that nobody filters on. The rule is: if no dashboard or saved query in the last 90 days used a field, it’s deadweight. Removing it is rarely controversial.

Correlation with traces

zerolog + OpenTelemetry + a trace ID field means you can click from a slow trace to its log lines. This turned out to be bigger than I expected. The mental shift from “find the traces” to “find the logs from the traces” is a real productivity win.

span := trace.SpanFromContext(ctx)
logger = logger.With().
    Str("trace_id", span.SpanContext().TraceID().String()).
    Str("span_id", span.SpanContext().SpanID().String()).
    Logger()

Put this in a middleware. Wire it through every request. Future-you will thank current-you.

The one thing I still don’t love about zerolog

The builder pattern (logger.Info().Str(...).Msg(...)) is easy to get wrong. If you forget .Msg() at the end, nothing is logged. If you call .Msg() twice, you get two lines. The API is more error-prone than a plain function call would be.

// bug: no Msg at the end, this line is silently dropped
logger.Info().Int("count", n)

// bug: two Msg calls, produces two log lines
evt := logger.Info().Int("count", n)
evt.Msg("starting")
evt.Msg("starting")  // this also fires

We have a linter that catches the first, not the second. Still, the pattern is there. Newer loggers (slog in the standard library, charmbracelet/log, etc.) use a function-call API that’s less error-prone. If I were starting fresh today I might pick slog.

Reflection

Structured logging is not about the logs, really. It’s about the ability to ask aggregate questions of your log data without string-parsing. The discipline around field naming, log volume, sampling, and correlation is what makes that ability real. Without the discipline, you have JSON-shaped garbage instead of text-shaped garbage. The shape isn’t the point; the queryability is.

Related: SLO math for tired engineers.