The label that killed Prometheus

Friday afternoon. Someone merged a PR that added a request_id label to our HTTP histogram. By 6pm the Prometheus server was OOM-ing every 20 minutes, our alerting pipeline was down, and I was eating cold noodles in front of a terminal.

What went wrong in one line

Every unique request_id creates a new time series. For every bucket of the histogram. For every quantile aggregation. The math goes off a cliff fast.

The math, in detail

Before the PR, the histogram looked like:

http_request_duration_seconds_bucket{
  method="GET", route="/users/:id", status="200", le="0.005"
}

Cardinality of unique label combinations: roughly methods * routes * statuses * buckets = 5 * 80 * 10 * 12 = ~48k active series. Comfortable.

After the PR, request_id joined that list. Request IDs are, by construction, unique per request. We do maybe 20 requests per second times however many we remembered. Active series exploded to ~18M in the first hour.

Prometheus ingest rate was fine. Storage was fine. The WAL started growing but not alarmingly. What died was the TSDB index — the mmap’d structure that maps label values to series references. At some point the index grew past physical RAM and the head compaction got permanently behind.

How we saw it (eventually)

The first symptom was a query timeout on an unrelated alert. Then the UI started hanging. Then OOM. I kept hitting /metrics on prometheus itself and eventually spotted:

prometheus_tsdb_head_series           18284112
prometheus_tsdb_head_series_created   1923471023
prometheus_tsdb_head_truncations      14

18M head series. For context, we had been running at about 180k. Two full orders of magnitude.

Mitigation

Step one was drop the bad label at the ingest side. We weren’t using remote-write but we were using a VictoriaMetrics agent for intake. A relabel rule got rid of the toxic label:

relabel_configs:
  - source_labels: [__name__, request_id]
    regex: "http_request_duration_seconds_.*;.+"
    action: drop

This doesn’t retroactively remove the series from the existing TSDB — they still had to age out. We took the downtime, deleted the head block, and restarted:

# warning: lose unflushed data
systemctl stop prometheus
rm -rf /var/lib/prometheus/wal /var/lib/prometheus/chunks_head
systemctl start prometheus

That got us back online, minus the last ~2 hours of metrics.

The permanent fix

Three changes:

Code review checklist: no unbounded labels. We document “unbounded” as anything that can have more than ~1000 distinct values over the life of the service.

Prometheus config with target_limit and label_limit to hard-stop runaway ingest:

scrape_configs:
  - job_name: api
    sample_limit: 50000
    label_limit: 15
    label_value_length_limit: 64

An alert on prometheus_tsdb_head_series growth rate. If it doubles in an hour, page. We’d have caught this in 30 minutes instead of 3 hours.

What belongs in labels and what belongs in traces

The real lesson is an old one, but I was the one who had to re-learn it: per-request identifiers belong in traces, not metrics. Metrics are for aggregate dimensions you care about in aggregate. If the label’s value is information only meaningful at the individual-request level, it’s trace data.

The PR author was trying to debug a specific user’s behavior. That’s a valid thing to want. The right tool was to add request_id to the trace span attributes, not to a histogram label. We went back and did that, and it works fine.

Reflection

I hate that I didn’t catch this in review. The PR looked innocuous. “Add request_id to the latency metric” is the kind of line you skim past. We’re now running a linter that flags any request_id, user_id, trace_id, session_id label addition to a Prometheus metric and requires explicit justification.

Related: Tail sampling that actually saved money was the sibling project for tracing.