docs/performance.md

6.9 KB · 198 lines · 2025-06-09 · 9b4aa82
# Performance

This is not a benchmarking paper. It's a record of how I measure
ripgrab so I can tell when a change makes things better or worse,
plus some numbers from my laptop for context.

## What I care about

Three numbers, in this order:

1. **Added latency per matching line.** How long between "kernel
   delivers a write event" and "renderer writes to stdout". This is
   the number a user actually feels.
2. **Throughput at the filter.** Lines per second the filter can
   evaluate. Matters when a log is bursting.
3. **Memory high-water.** Small, bounded, no leaks. I don't run a
   log tailer that grows over days.

I do not optimise for "cold start time" - ripgrab starts in ~5 ms
and that's fine.

## Setup

- ThinkPad X1 Carbon Gen 11
- 13th gen i7, 16 GB RAM
- Linux 6.11, btrfs, NVMe SSD
- Rust 1.79 stable (MSRV is 1.76)
- Release build: `cargo build --release`

Numbers are averages of five runs, machine otherwise idle. All
tests write synthetic log lines of 120 bytes. Real logs vary a lot
per-byte, so treat these as "orders of magnitude" rather than
authoritative.

## Benchmarks

### 1. Pure throughput

    cargo build --release
    yes "2025-06-09T12:34:56Z INFO rid=deadbeef ms=12 path=/foo" \
      | head -n 5000000 > /tmp/big.log
    time ./target/release/ripgrab --no-follow --match 'INFO' /tmp/big.log \
      > /dev/null

    real    0m1.94s
    user    0m1.82s
    sys     0m0.11s

That's ~2.6 M lines/s piped through a single `--match` pattern.
Replacing the include with a `RegexSet` of 10 patterns (`--match`
repeated) drops it to ~1.9 M lines/s.

### 2. Extract

    time ./target/release/ripgrab --no-follow \
      --extract 'rid=(?P<rid>\w+) ms=(?P<ms>\d+)' /tmp/big.log \
      > /dev/null

    real    0m3.81s
    user    0m3.62s
    sys     0m0.14s

Capture allocation dominates. Every matched line does two
`to_string()` calls inside the extractor. I have tried swapping to
`&str` slices of the input buffer but the lifetimes got ugly and
the win was only ~15%. Parked.

### 3. Tail latency

With a fresh file being appended to by a Python script at 10 kHz:

    # writer: stdout-logger.py
    # reader: ripgrab --match 'marker' live.log | ts '%H:%M:%.S'

Adding a tiny per-line `time.time()` stamp to both writer and
reader, the p50 added latency is around 1.3 ms and p99 is 4.2 ms.
Those numbers are dominated by the inotify path + tokio task hop;
the filter contribution is sub-microsecond.

The fallback poller path (no inotify) of course caps at the poll
cadence - 200 ms (`c17d3f0`). Not used on Linux in practice.

### 4. Memory

RSS steady at ~14 MB whether tailing one file or twenty. The
per-file watcher tasks each hold a fixed-size line buffer and a
bounded channel; there's no state that grows with time.

Extraction tables buffer field widths, so streams with pathological
1 MB fields do spike (bounded by the line length). I truncate to
terminal width at render time (`77c2a8b`).

## Methodology quirks

Things I learned the hard way:

- **Always run release builds.** Debug builds of the regex engine
  are 5-20x slower and will tell you lies.
- **Warm up the file cache.** `cat /tmp/big.log > /dev/null` once
  before benchmarking, or vary order.
- **`time` is fine for CPU-bound runs. For latency use
  `hyperfine`** (which I do before making any claim that a change
  is faster).
- **Don't trust `perf stat` on laptops with thermal throttling.**
  Best of 3 runs, not mean.

## When things got slower

Regressions I caught because I ran the bench:

- `1a0f562` "tail: handle truncation / rotation on inode change"
  added an `fstat` per poll iteration. Made the polling fallback
  ~15% slower. Acceptable; the correctness win matters more.
- `bf21c40` "filter: structured extraction via named capture
  groups" introduced the capture allocation cost described above.
  Extract was always going to be slower than match; it went from
  negligible to the dominant cost.
- `9b4aa82` "filter: compile regex set once per session" was me
  fixing my own earlier regression. I'd been rebuilding the set
  whenever `--since` was re-evaluated (don't ask). Fixing that got
  throughput back up.

## When things got faster

- `77c2a8b` (truncate long fields) paradoxically sped up the
  rendering path for pathological inputs: we write fewer bytes, so
  stdout flushes less often.
- Switching to `aho-corasick`-backed include sets (part of `regex`
  1.10.x) shaved ~10% on multi-pattern runs; nothing I wrote, just
  tracking the compiler.

## Comparing to ripgrep

I'm often asked. ripgrep is faster on one-shot searches, of course.
It has no tailing mode, so the comparison is really "grep a file
once with ripgrep" vs "same with ripgrab --no-follow". ripgrep wins
by ~1.3x on the pure-match benchmark above because it has more
optimisations in the byte-level matcher. For the use case ripgrab
exists to serve - concurrent tail of several files - ripgrep
doesn't compete and I haven't written that benchmark.

## Things I'd look at if I needed another factor of 2

I don't need it. If I did:

- **Avoid `String` at the line boundary.** Keep `Bytes` until the
  renderer, decode UTF-8 lazily. Probably 15-20% throughput on
  simple matches.
- **A single-thread runtime for the simple case.** We already do
  this, but I'd look at `current_thread` vs `multi_thread` and
  retire the latter. Tokio's multi-thread runtime pays overhead we
  don't need.
- **Compiled capture-to-column mapping.** The extractor currently
  hashes capture names to column indices on every match. A table
  built at config time would cut that.

None of these justify the complexity unless ripgrab is in your hot
loop, in which case you want a different tool.

## Reproducibility

To run the basic bench on your own machine:

    cargo build --release
    make bench    # (in my dev tree; not published)

Without the Makefile, the three commands above are everything. I
don't check in fixtures because synthetic data compresses too well
and masks I/O realism.

Latency measurement needs a writer you control; I use a small Python
script in `bench/` in my local tree, not checked in. If you want the
exact script, open an issue and I'll paste it.

## Summary

ripgrab is fast enough. The bottlenecks are all I/O. The filter
and render stages are cheap; the places to look if something feels
slow are the regex patterns themselves, not the tool.