docs/cold-start.md

6.2 KB · 181 lines · 2025-03-04 · 8f5d11a
# Cold Start

Every feature in lambdalog is measured against its cold-start cost.
This document is a list of the tricks the library uses to keep that
cost close to zero, plus the trade-offs behind each one.

For the design rationale, see
[docs/design.md](/src/lambdalog/docs-design-md/).

## What "cold start" means

A Lambda cold start is the wall-clock time from "container created"
to "handler function first called". It is billed. For Go Lambdas
running on AL2, a reasonable baseline is 120-200 ms. Any library
you import is on that critical path.

I care about two numbers:

1. **Added init time.** The milliseconds that importing lambdalog
   adds to init.
2. **Added first-log time.** The time from `lambdalog.New()` to the
   first emitted record, which affects the first invocation after a
   cold start.

Target: both under 0.5 ms on my laptop, well under a millisecond
on Lambda's runtime.

## Measurement

    go test -run NONE -bench BenchmarkNew -benchmem ./...

Inside Lambda I set `LAMBDALOG_BENCH_COLD_START=1` and log the time
from init to first write. That value is the ground truth - bench
numbers are indicative.

## The tricks

### 1. No init functions

`lambdalog` exports no `init()`. The package imports nothing beyond
the standard library. Imports done at package init aren't free, even
if each is fast, because the runtime walks them all before calling
your handler. I pay this cost once for stdlib imports I can't avoid
(`encoding/json` in tests only; not in main path) and zero more for
ones I can.

    $ go list -deps . | wc -l
    18

18 packages in total. Most are stdlib. Nothing is reflective.

### 2. Lazy buffer allocation

The 4 KiB scratch buffer in `*Logger` is allocated on first write,
not at `New`. Cold starts that never log (health checks from AWS,
for instance) don't pay for the buffer.

    func (l *Logger) scratch() *bytes.Buffer {
        if l.buf == nil {
            l.buf = bytes.NewBuffer(make([]byte, 0, 4096))
        }
        return l.buf
    }

There is a subtle race here that I handle with a once-guard in
concurrent code paths. Lambda's usual single-invocation model
means the guard rarely fires.

### 3. Compiled timestamp format

The timestamp layout is a constant string; the formatter uses
`time.Time.AppendFormat` against a preallocated byte slice. No
`Sprintf`, no reflection. Seen in commit `d7a0b13` when I moved
from `time.Format` to `AppendFormat` and saved an allocation per
record.

### 4. No reflection in hot paths

The `appendValue` switch handles the small set of types that appear
in practice: string, int (all widths), uint (all widths), bool,
float64, time.Time, []byte, error. Everything else falls through to
`fmt.Sprintf("%v", v)` and does allocate - but only for the
unusual case, which I explicitly do not optimise.

### 5. Sampler state reset on cold start

The sampler keeps per-tag counters in a sharded map. On cold start
we want a fresh slate so the new container starts observing from
zero, not inheriting state from a previous container that doesn't
exist. `8f5d11a` was the commit that made this explicit; before
then, Go's zero value happened to be right but the intent wasn't
obvious.

    func (s *Sampler) resetOnColdStart() {
        s.window.Reset()
        s.activeTags.Range(...)  // zero counters
    }

This runs at most once per container life.

### 6. No logger singleton

I considered exposing a package-level `lambdalog.Default()` that
memoised a logger. Decided against it: a singleton means that the
initial `With("service", ...)` call either (a) mutates the
singleton, which has concurrency implications, or (b) creates a
clone on every call, which defeats the memoization.

Instead, the idiomatic pattern is:

    var logger = lambdalog.New(os.Stdout).With("service", "checkout")

    func Handle(ctx context.Context, ev Event) error {
        log := logger.FromContext(ctx)
        ...
    }

The top-level `logger` value is initialised once per container, at
first reference, by the Go runtime. We don't need a helper for it.

### 7. Minimal imports

The library's total dependencies:

    stdlib only

No `github.com/...` imports. Every extra import adds compile time,
binary size, and cold-start cost. The decision to not pull in
`aws-lambda-go` as a dependency was deliberate: lambdalog reaches
into a `context.Context` using the interface that `aws-lambda-go`
satisfies, but we don't import its package.

### 8. Buffered writer flush policy

The `bufio.Writer` flushes on:

- `Warn` and `Error` records (so tail latencies on important logs
  are bounded)
- `Logger.Flush()` called explicitly
- buffer fill (4 KiB)

It does not flush on every `Info` or `Debug`. That lets most
invocations emit all their logs in a single syscall to stdout.
Lambda's stdout capture is cheap per call but fewer calls are
faster.

## What I consciously don't optimise

- **Extracting the Lambda request ID** is a context lookup plus a
  type assertion. ~100 ns. Not worth caching at package level.
- **UTF-8 validation.** stdlib `strconv.AppendQuote` does it, I
  don't second-guess. If this ever showed up in a profile, I'd
  switch to `json.HTMLEscape`-style manual escaping; it hasn't.
- **Attribute map size.** The map starts at zero capacity. If you
  add 100 attributes, you pay for the grows. I consider 100
  attributes a user error.

## Benchmarks (as of 2025-03-04)

On my laptop, release build:

    BenchmarkNew-8               5000000    210 ns/op     0 B/op    0 allocs/op
    BenchmarkInfo_Primitive-8    2000000    502 ns/op     0 B/op    0 allocs/op
    BenchmarkWith-8              3000000    340 ns/op   192 B/op    1 allocs/op
    BenchmarkFromContext-8       5000000    150 ns/op     0 B/op    0 allocs/op

`With` is the one that allocates, because it deep-copies the
attribute map (`2c14b07`). The alternative - shared map with
reference counting - was not worth the complexity.

## Lambda-observed numbers

From a production Go 1.22 Lambda over one week, init breakdown:

- Go runtime init: ~70 ms
- `lambdalog` package init: < 1 ms
- First `logger.Info` call: 0.4 ms p50, 1.1 ms p99

The p99 is where the buffer gets allocated. It is an acceptable
tail and I have not tried to flatten it further.

## Summary

The cold-start story is boring, which is exactly the goal. The only
interesting line of code is the sampler reset, and even that
amounts to "zero some counters." Everything else is just the normal
Go practice of avoiding reflection and unnecessary work.