docs/design.md

9.1 KB · 264 lines · 2025-01-20 · 2c14b07
# Design

lambdalog is a small structured logger. This document is the "why":
why it exists, why it doesn't reuse slog or zerolog or zap, and how
its threading model is shaped by AWS Lambda.

If you want to read code, start at
[`lambdalog.go`](/src/lambdalog/lambdalog-go/) and
[`context.go`](/src/lambdalog/context-go/). For cold-start specifics,
see [docs/cold-start.md](/src/lambdalog/docs-cold-start-md/).

## The problem

The specific nuisance: Lambda functions bill by 1 ms of wall clock,
and a function that spends 4 ms formatting log records per invocation
costs you 4 ms * (invocation count) per month. At the scale I was
running when I wrote this (a few hundred million invocations per
month), that rounds to real money and real p99 latency.

I profiled with `-cpuprofile` inside a Lambda and found:

- `json.Marshal` on log records ~40% of logger CPU
- `runtime.mallocgc` from string concatenation ~25%
- `sync.Mutex.Lock` on `os.Stdout` when two goroutines logged
  concurrently ~15%
- attribute map copies ~10%

Three of those four are fixable by a library; one is a Go standard
library decision. The pragmatic path was: build a logger that avoids
allocations in the hot path, writes directly to a buffered writer,
and picks a concurrency model suited to the Lambda runtime.

## Why not slog

`log/slog` (Go 1.21+) is the obvious candidate. It is well-designed
and I use it outside Lambda. For this workload:

- Its attribute representation (`Attr`) boxes values into an
  `interface{}`. Fine for normal servers, a measurable hit at
  Lambda scale.
- The `TextHandler` and `JSONHandler` use `sync.Pool`ed encoders,
  which helps, but still allocate on cold paths.
- Request-ID correlation is not built in. You have to wire it
  through `slog.With` on every handler, which is exactly the thing
  I wanted to make automatic via `FromContext`.

I started by writing a `slog.Handler` that did the right thing.
After a while it was 70% of a from-scratch logger with 30% of the
performance, and I deleted it. slog is a fine target for users that
want a drop-in replacement; if someone wants to write a
`slog.Handler` backed by lambdalog, the plumbing would be small and
I'd accept the patch.

## Why not zap or zerolog

Both are excellent and both optimised for high-throughput servers.
The reasons they don't fit:

- **zap**: its field-typing approach (`zap.String("k", v)`) is
  zero-alloc and beautiful, but the API surface is large and the
  cold-start cost of wiring it up is non-trivial. `Sugar` gives a
  looser API at an allocation cost. In a Lambda you want small cold
  binary + minimal init.
- **zerolog**: builder-pattern API (`log.Info().Str("k",v).Msg("...")`)
  is ergonomic and fast, but the builder relies on `bytes.Buffer`
  pools that don't always win inside Lambda's constrained runtime.
  Also, its JSON is mostly-but-not-quite what I want (it double-
  encodes floats differently from stdlib).

Both are great when you have a long-running server. lambdalog's
niche is "the process lives for 100 ms to 15 minutes and is
restarted constantly." That shapes everything.

## What lambdalog optimises for

- **Predictable cost per log record.** 0 allocations on the happy
  path when attributes are primitives. One allocation for the line
  itself if you are unlucky with buffer sizing.
- **Request-ID correlation that is free to use.** A
  `Logger.FromContext(ctx)` call pulls the Lambda request ID from
  the context once and caches it on the returned `*Logger`.
- **Adaptive sampling.** Gives you a knob when a log statement
  misbehaves; see `sampler.go`.
- **No reflection.** The hot path writes primitives with a small
  set of helpers (`appendString`, `appendInt`, `appendBool`,
  `appendFloat64`).
- **ISO-8601 timestamps with nanoseconds.** Matches what
  CloudWatch's logs ingestion wants; one less transformation
  downstream.

## API shape

The exported API is small:

    type Logger struct { ... }

    func New(w io.Writer) *Logger
    func (l *Logger) With(key string, value any) *Logger
    func (l *Logger) FromContext(ctx context.Context) *Logger

    func (l *Logger) Debug(msg string, kv ...any)
    func (l *Logger) Info(msg string, kv ...any)
    func (l *Logger) Warn(msg string, kv ...any)
    func (l *Logger) Error(msg string, kv ...any)

    func (l *Logger) Sampled(tag string, oneIn int) *Logger

Everything is a method on `*Logger`. `With` returns a new logger
with an extra attribute copied onto a new map; we deep-copy to
avoid tearing across goroutines (`2c14b07`). `FromContext` is the
one place where we pull data out of the context; users don't need
to reach for `context.Value` themselves.

The `kv ...any` variadic is intentionally untyped. slog-like
typed attribute helpers are a bigger API; if you want them, wrap
`lambdalog` in your own helpers.

## Threading model

A Lambda container serves one invocation at a time by default.
Provisioned concurrency containers serve one invocation at a time
each. Standard concurrency gives you multiple containers, each
with one. There is effectively no intra-process concurrency for
most Lambdas, which changes a lot.

lambdalog does two things that reflect this:

1. **`os.Stdout` serialization.** Lambda writes to stdout, which
   is captured by the runtime and forwarded to CloudWatch. Two
   concurrent writers produce interleaved bytes. We guard stdout
   with a single `sync.Mutex` in the `Logger`, and write in full
   records per lock acquisition. This is the "contention" I saw
   in the profile - eliminated by batching into a 4 KiB scratch
   buffer before writing.
2. **Buffered writer.** `logger.w` is a `bufio.Writer` sized at
   4 KiB by default. For most Lambdas that is larger than the per-
   invocation log volume, and we get one syscall per invocation
   rather than one per record. The buffer flushes on every `Warn`
   and `Error` so important records aren't held.

This is overkill for a CLI tool and underkill for an nginx
replacement. It is right-sized for Lambda.

## Context correlation

`context.go` knows two things:

- How to extract the Lambda request ID from a `context.Context`.
  The `aws-lambda-go/lambdacontext.FromContext` function returns a
  struct with an `AwsRequestID` field. We pull it, we attach it
  as the `rid` field on the child logger.
- How to deep-copy the attribute map when creating a child logger.

Why deep-copy instead of pointer-share? Because `FromContext` may
be called from a goroutine that then further `With`-es the
logger. If two goroutines share the underlying map, they see each
other's writes. Deep-copying is one allocation per `With` call;
cheap compared to the alternative bug.

## JSON format

One object per line. Stable field order: `ts`, `lvl`, `msg`, then
any attributes in insertion order. Attributes with the same name
as a reserved field are suffixed with `_1`, `_2`, etc., so a user
who writes `log.Info("x", "ts", ...)` doesn't stomp the timestamp.

Fields:

- `ts` - ISO-8601 with nanoseconds, UTC
- `lvl` - `"debug"`, `"info"`, `"warn"`, `"error"`
- `msg` - the first positional argument
- `rid` - (optional) Lambda request ID if `FromContext` was used
- `service`, others - user-supplied via `With`

Strings are UTF-8, encoded with stdlib-equivalent escaping rules
implemented in `appendString` without going through `json.Marshal`.
Floats use `strconv.AppendFloat` with `'g'` + precision `-1`.

## Sampling

`Sampled(tag, oneIn)` wraps the logger such that calls to `Info`
(and above) on high-QPS sites are decimated. The counter lives in
a sharded atomic map keyed by `tag`; contention is low because
each tag is its own shard. See
[`sampler.go`](/src/lambdalog/sampler-go/).

The adaptive piece (`9cc1270`): when `tag` sees more than
`threshold` events in a rolling window, the sampler activates and
drops the 1:N ratio. When the rate drops, sampling turns off. This
way, normal operations log everything; a runaway log statement
during an incident doesn't flood CloudWatch.

Sampler state is per-container. Cold start resets it. That is the
behaviour I want (`8f5d11a`) - each new container starts with
fresh observation.

## What lambdalog does not do

- **No file output.** Lambda writes to stdout. If you need file
  output, use something else.
- **No log level filtering by package.** You can filter by tag via
  the sampler, but global level is the only hierarchical control.
  Lambda functions are small; I have not needed more.
- **No auto-redaction.** There are many opinionated libraries that
  do this. It should be composable: write a small `io.Writer`
  wrapper that redacts, and pass it to `New`.

## Testing

`lambdalog_test.go` hits every code path including the JSON
output. Benchmarks live next door (`8f5d11a` kept them fast; `f3b8e55`
fixed a flaky timing test by moving it to a subtest).

I also run a fuzz test (`go test -fuzz=FuzzRecord`) over attribute
combinations to catch any JSON escaping bugs. Runs for an hour in
CI periodically.

## Roadmap

- A `slog.Handler` wrapper so you can use lambdalog from slog-based
  codebases without rewriting.
- A `cloudwatch` helper that also emits EMF metrics (CloudWatch
  Embedded Metric Format) from log records. I have a draft.
- `cold-start.md` for the tricks I use to keep init cheap.

All of the above are additive. The v0.2 API is frozen (`71ae9d4`)
and I'm not going to break it.