Tail sampling that actually saved money

Our tracing bill was hurting. About $14k/month on a span ingest that was 95% noise — health checks, successful CRUD on internal admin routes, warmup traffic, the usual. Head sampling at 10% worked but dropped the rare slow traces we actually needed. Classic problem.

We moved to tail sampling. It’s not a trivial setup, but it solved the ingest cost and improved our signal.

Head vs tail in one paragraph

Head sampling makes the keep/drop decision when a trace starts. Cheap, simple, uncorrelated across services. Tail sampling waits until the trace is complete (or mostly complete) and then decides. Expensive to implement, but you can sample based on things like “was there an error?” or “did any span take more than 500ms?” — things you can’t know at trace start.

Our config

We run the OpenTelemetry Collector’s tail_sampling processor. Roughly:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: rare-tenants
        type: string_attribute
        string_attribute:
          key: tenant.tier
          values: [enterprise]
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Four policies, OR’d together:

Keep all errors.
Keep all slow traces.
Keep everything from enterprise tenants (we want to debug for them even when things look fine).
Keep 5% of the rest as a baseline.

The rest get dropped at the collector, before they hit the backend.

The math

We were ingesting about 40k spans/sec at peak. Head sampling at 10% got that to 4k. Tail sampling gets us to about 3.5k on average, but with 100% error coverage and 100% slow-trace coverage. The 5% baseline accounts for most of the remaining volume.

Cost dropped from ~$14k/month to ~$2.4k/month. That paid for the engineering time in its first week.

Things that made this harder than the docs suggested

Decision wait vs span out-of-order. We set decision_wait: 10s because most of our traces complete well under that. A small tail of long-running spans (background jobs, streaming responses) don’t complete in time and their sampling decision is made without full info. We handle those separately — background job spans bypass tail sampling entirely and go through a head sampler with a higher rate.

Memory. num_traces: 100000 sounds generous, but each trace holds all its spans in memory until the decision is made. We had to bump the collector to 8GB before it stopped getting OOM-killed. YMMV but size it realistically.

Trace fragmentation across collectors. Tail sampling needs all spans for a trace to land on the same collector. We use a load balancer in front of our collectors that hashes on trace_id, which was fine once we configured it. If your spans are fragmented across multiple collectors the sampling decision is inconsistent and you get partial traces, which is worse than useless.

Probabilistic “baseline” is not literally baseline. The probabilistic policy applies only if no other policy has already decided to keep the trace. So your 5% is 5% of traces that weren’t already kept for being erroring or slow. This matters for cost modeling.

The trap I almost walked into

Initially I configured a policy “keep 100% of traces with any span slower than 500ms” and a policy “keep 100% of traces with any error.” Then I added “drop everything under 50ms.” You can’t AND policies in tail_sampling the way you’d want — policies are OR’d and evaluated independently. My “drop fast traces” was a no-op because I couldn’t express “drop this trace if no other policy kept it.”

The workaround is to think in terms of “what is my baseline keep rate, and what are my exceptions to the baseline.” Don’t try to drop things; just don’t keep them.

Reflection

The real win here wasn’t the cost. It was that our on-call stopped missing signal. Head sampling had been dropping the rare hard cases for years, and we’d been covering for it with overly defensive logging. Now, we drop the boring stuff and trust the tracing for the rest.

Related: High cardinality in Prometheus was what pushed us to tighten metric cardinality too.