docs/architecture.md

10.2 KB · 286 lines · 2025-06-28 · c17d3f0
# Architecture

ripgrab is organised as a small, predictable pipeline: sources pump
bytes in, filters pick which lines survive, and a renderer formats
what reaches the terminal. Everything runs inside a single tokio
runtime; the only threads are the runtime's worker pool.

This document covers the pipeline, backpressure, and the shape of
each stage. For the filter grammar, see
[docs/filter-dsl.md](/src/ripgrab/docs-filter-dsl-md/). For how I
measure performance, see [docs/performance.md](/src/ripgrab/docs-performance-md/).

## The pipeline

                +--------+     +--------+     +----------+
     N files -> | source | --> | filter | --> | renderer | -> stdout
                +--------+     +--------+     +----------+
                (N tasks)      (1 task)       (1 task)

- Sources are per-file tasks that tail a path and produce `Line`
  values. They own the OS-specific watcher.
- Filter is a single task that owns the compiled regex set and the
  extraction patterns. It sees the merged stream and decides what
  passes.
- Renderer is the last task. It formats rows (plain or tabular),
  applies color, writes to stdout.

Each arrow is a bounded `tokio::sync::mpsc` channel. The bounds are
deliberately small - 256 lines - because a full downstream buffer is
how we push back on a source that produces faster than the renderer
can draw.

## Why one filter task and not N

Early versions did filtering per source. It felt parallel and clean.
Two things made me collapse it into a single task:

1. **Regex sharing.** The `regex::RegexSet` is expensive to build;
   compiling once and sharing read-only across tasks is simpler
   than handing each source its own copy. The `Arc<RegexSet>` lives
   in the filter task's state.
2. **Deterministic ordering.** When two sources see a line at the
   same instant, the merged order depends on which task the runtime
   polled first. A single filter task reads from a `tokio::select!`
   and at least gives stable output within a single run.

Filtering itself is cheap compared to I/O. The single-task choice
costs nothing measurable.

## Sources

`src/tail.rs` implements the tail logic. Per platform:

- **Linux**: `inotify` via the `inotify` crate, watching for
  `IN_MODIFY`, `IN_MOVE_SELF`, `IN_CLOSE_WRITE`. On `IN_MOVE_SELF` we
  reopen the path (common pattern with logrotate).
- **macOS**: `kqueue` through the `notify` crate. Fewer event types
  to distinguish; rotation still works via filesystem event.
- **Fallback**: a 200ms poll (`c17d3f0`). Not ideal, but every
  platform has something that doesn't notify reliably.

Truncation detection is separate from rotation. We track `(inode,
size)`. If `size` shrinks, we seek back to 0 and start again. If
`inode` changes, we close and reopen. Commit `1a0f562` added both.

The source yields `Line` structs:

    pub struct Line {
        pub source: Arc<str>,   // label, usually the filename
        pub body: String,       // UTF-8; invalid bytes replaced
        pub when: SystemTime,   // capture time, not log time
    }

UTF-8 substitution uses `String::from_utf8_lossy`. Logs are usually
UTF-8 these days and I would rather keep the line and flag bad bytes
than drop it.

## Filter

`src/filter.rs` holds:

- an `Arc<RegexSet>` for the include patterns (`--match`)
- an `Arc<RegexSet>` for the exclude patterns (`--exclude`)
- a `Vec<Regex>` for extraction patterns (`--extract`)
- a `Since` duration filter for `--since`

A line passes if:

    (includes empty OR any include matches) AND
    (no exclude matches) AND
    (within --since window)

When any extractor matches, the line is emitted as `Line::Extracted`
carrying the capture map; otherwise it is `Line::Raw`. The renderer
handles both variants.

The regex set is compiled once per session (`9b4aa82`). Before that
commit, I was recompiling on every config change in interactive mode,
which hurt.

## Renderer

`src/render.rs` owns stdout. Two output modes:

- **stream**: one row per line, prefixed with the source label. ANSI
  colors if the TTY supports them and `NO_COLOR` is not set
  (`e2b9a41`).
- **table**: triggered when any `--extract` patterns are configured.
  Extracted lines go into a column-aligned table; unmatched lines
  continue in stream mode above the table.

Field widths for the table come from a running max with a small
ceiling; a sudden 400-character field doesn't blow out every row
(`77c2a8b` truncates to terminal width).

The renderer uses `crossterm` only for color detection and raw mode.
There is no alternate screen; output is plain stdout you can pipe.

## Async task model

Every stage is a `tokio::task`. The runtime is single-threaded by
default (one worker + blocking pool), chosen because ripgrab's
workload is almost entirely I/O wait. Users who really want more
can set `RIPGRAB_THREADS`, but I have never needed to.

The task graph is spawned in `src/lib.rs::run()`:

    let (src_tx, src_rx) = mpsc::channel(256);
    let (flt_tx, flt_rx) = mpsc::channel(256);

    for path in paths {
        task::spawn(source::run(path, src_tx.clone()));
    }
    task::spawn(filter::run(filter_cfg, src_rx, flt_tx));
    task::spawn(render::run(render_cfg, flt_rx)).await?;

Dropping the last clone of `src_tx` (all sources finished) closes
the channel, which cleanly shuts the filter down, which then closes
the filter channel, shutting the renderer. No explicit shutdown
message needed.

## Backpressure

Each channel is bounded. When the renderer can't keep up:

1. `flt_tx.send()` in the filter task awaits.
2. The filter task stops draining `src_rx`.
3. Source tasks' `src_tx.send()` await.
4. Source tasks stop reading from inotify / the file.
5. The kernel's inotify buffer fills, but that's OK - on drain, the
   sources read from the current file offset, which has moved
   forward in the meantime. We don't lose anything from the file
   itself; we just don't see events in real time until the terminal
   catches up.

The consequence is that redrawing stdout slowly makes ripgrab lag,
not crash. A user scrolling in their terminal emulator won't cause
memory to balloon.

## Signals and shutdown

SIGINT triggers a `shutdown` flag checked by every task. Sources
close their file handles and watchers; the filter drains remaining
work; the renderer flushes stdout and returns.

Exit status:

- 0 clean SIGINT
- 2 bad path (one of the input files couldn't be opened)
- 64 CLI or regex compile error

`--no-follow` skips the source watcher entirely: we read to EOF and
shut down on clean EOF.

## Error handling

I use `anyhow` in `main.rs` for user-facing errors and `thiserror`
for typed errors in `src/filter.rs`. The boundary is "things the
user might see" versus "things other code might want to match on."
Only the filter and tail modules have typed errors, because they are
the only ones with failure modes interesting enough to discriminate.

A tight rule: errors inside the pipeline are logged to stderr and do
not crash the process unless they're unrecoverable. An unreadable
file logs "open failed" and exits the source task; the others keep
going.

## Files and what's in them

| Path                                                               | Purpose                                    |
|-------------------------------------------------------------------|--------------------------------------------|
| [`src/main.rs`](/src/ripgrab/src-main-rs/)                        | CLI parsing (clap), log init, call `run`   |
| [`src/lib.rs`](/src/ripgrab/src-lib-rs/)                          | `run()` entry point, task graph setup      |
| [`src/tail.rs`](/src/ripgrab/src-tail-rs/)                        | per-file watcher + line reader             |
| [`src/filter.rs`](/src/ripgrab/src-filter-rs/)                    | include/exclude/extract evaluation         |
| [`src/render.rs`](/src/ripgrab/src-render-rs/)                    | stream and table output                    |
| [`tests/cli.rs`](/src/ripgrab/tests-cli-rs/)                      | `assert_cmd` end-to-end tests              |

## Design choices worth calling out

- **No config file.** I add flags. If you want defaults, wrap
  ripgrab in a shell alias.
- **Regex only, no glob or substring shortcuts.** The distinction
  was always going to be fuzzy; regex is what users already know.
- **No structured output.** There was a pull request once for
  `--json`; I closed it because the use case (feeding another tool)
  is better served by a separate program that reads the same files.
- **MSRV pinned to 1.76.** Tracks the stable Rust from ~6 months
  before the last 0.x release. Bumped in `58a63d1`.

## Tests

`tests/cli.rs` uses `assert_cmd` to spawn the binary against fixture
files in `tests/fixtures`. Each test writes a fixture, runs the
binary with flags, and asserts on stdout. It's the layer I trust
most; the unit tests inside each module cover the trickier logic
(regex-set compilation, capture group naming, rotation detection).

CI runs `cargo test` on Linux (ubuntu-latest) and macOS (macos-14)
with the MSRV and stable toolchains.

## What I would change if I started over

Probably not much. The pipeline shape is the right fit for the
problem, backpressure behaves the way I want, and the code is small
enough that changes cost little. If anything, I'd start with
`tokio::sync::broadcast` so I could cleanly tee the stream to a
logger for replay tests. I can still bolt that on.