docs/filter-dsl.md

7.8 KB · 231 lines · 2025-06-09 · 9b4aa82
# Filter DSL

ripgrab's "DSL" is deliberately thin: the regex dialect from the
`regex` crate, plus three flags that compose. This document describes
exactly what each flag accepts, how the flags interact, and how
named captures become extracted fields.

If you're looking for the big picture, see
[docs/architecture.md](/src/ripgrab/docs-architecture-md/).

## Grammar

In EBNF-ish:

    filter       = include* exclude* extract* since?

    include      = "--match" regex
    exclude      = "--exclude" regex
    extract      = "--extract" regex_with_named_captures
    since        = "--since" duration

    regex        = any regex supported by the `regex` crate
                   (Rust regex syntax, no lookarounds, no backrefs)

    duration     = <number><unit>
    unit         = "s" | "m" | "h" | "d"

A line passes the filter iff:

    (includes is empty  OR  at least one include matches) AND
    (no exclude matches) AND
    (since is unset OR line.when >= now - since)

The order of flags on the command line does not matter. You can
repeat each flag as many times as you want.

## Regex dialect

The underlying engine is `regex`, which is Rust's default. Important
differences from PCRE:

- No lookahead or lookbehind. If you need these, pre-filter the log.
- No backreferences. Same reason.
- Unicode by default. `\w` matches letters in any script. Use
  `(?-u:\w)` for ASCII-only.
- Case-insensitive with `(?i)` inline flag: `(?i)error`.

Nothing here should surprise you if you've used ripgrep.

## Includes

    ripgrab --match 'timeout|refused' app.log

Multiple `--match` flags OR together:

    ripgrab --match 'error' --match 'panic' app.log

is equivalent to `ripgrab --match 'error|panic' app.log`. The first
form builds a RegexSet internally; on large lists the set form is
faster because the engine can prune alternatives in a single pass.

When no `--match` is present, every line is considered for the next
stage.

## Excludes

    ripgrab --exclude 'GET /healthz' --exclude 'GET /metrics' app.log

Any `--exclude` match rejects the line. Applied after includes.

The usual pattern I use:

    ripgrab \
      --match 'ERROR|WARN' \
      --exclude 'context canceled' \
      --exclude 'transport is closing' \
      service.log

That is: "show me errors and warnings, but I know about these two
noisy ones."

## Extract

    ripgrab --extract '(?P<rid>[a-f0-9]{16}).*latency=(?P<ms>\d+)ms' api.log

Every named capture becomes a column in the output. On a match:

    rid               ms    line
    a1b2c3d4e5f60718  142   ...original line truncated to width...

Unmatched lines continue in the default stream output. That interplay
is deliberate: you can have a running commentary of everything
happening (stream mode) and still promote structured events (table
mode) without losing context.

You can have several `--extract` patterns. They are tried in order;
the first match wins for a given line. Capture names across patterns
should agree if you want a single coherent table; if they don't, you
get a wider table with `-` in the unfilled cells.

Named capture syntax is `(?P<name>...)`. Rust's regex crate also
accepts `(?<name>...)` since 1.10.0; either is fine.

### Capture group naming rules

- Must start with a letter
- Letters, digits, and underscores only
- Duplicate names across patterns: last one wins in that line
- Anonymous groups (`(...)`) are allowed but ignored in output
- Empty matches don't produce a row; the line falls to stream mode

## Since

    ripgrab --since 10m app.log
    ripgrab --since 2h app.log
    ripgrab --since 1d audit.log

Parsed in the obvious way. `s`, `m`, `h`, `d`. Combining units is
not supported; pick the coarsest.

`--since` works on the capture time, not the log's embedded
timestamp. For follow mode this is moot (every line is captured
"now"). For `--no-follow` it means ripgrab skips stale lines from
the file's past, using `st_mtime` as a hint for where to seek.

## Interaction examples

1. Extract request ids for errors only:

        ripgrab \
          --match 'ERROR' \
          --extract 'rid=(?P<rid>[a-f0-9]{16})' \
          api.log

   Lines without `ERROR` are dropped. Lines with `ERROR` but no `rid=`
   show in stream mode. Lines with both show in the table.

2. Merge two files, tag source, exclude a known flake:

        ripgrab \
          --exclude 'flaky-test-name' \
          svc-a.log svc-b.log

   Source label is the filename.

3. Show structured latency, ignore fast requests:

        ripgrab \
          --extract 'path=(?P<p>\S+) ms=(?P<ms>\d+)' \
          --match 'ms=[0-9]{4,}' \
          access.log

   `--match` ensures 4+ digit ms values; extract gives you the
   table. Note ripgrab doesn't sort or aggregate - pipe to `sort`
   or `awk` if you need that.

## Precedence and pitfalls

- Filters apply in order: includes, excludes, extract, since. An
  `--exclude` that matches a line cannot be overridden by an
  `--extract`.
- Anchors matter: `^ERROR` is much faster than `ERROR` on long lines.
  The regex engine prefix-literal-optimises where it can; anchors
  help it.
- Unicode case folding is expensive. If your input is ASCII, use
  `(?i-u:foo)` instead of `(?i)foo`.

## What the DSL will never have

Listed here so I remember when someone asks:

- **No "and" except via composition of flags.** Combining patterns
  with AND is something the user should express by writing a single
  regex like `(?=.*foo)(?=.*bar)` - which ripgrab's engine doesn't
  support, because that is a lookaround. If you need that, pre-grep
  or use `ripgrep`'s `--multiline`. ripgrab stays out of this.
- **No arithmetic on captures.** Once you want to compare `ms > 500`
  you have left "quick tail filter" territory and want something
  like Vector, Bento, or a proper query engine.
- **No config files.** The flags are the config.
- **No callbacks / hooks.** Stays a read-only pipe.

## Performance notes

- `--match` and `--exclude` compile into a single `RegexSet` each.
  That means "match any of N patterns" runs in time proportional to
  the line length, not N.
- `--extract` patterns don't share a RegexSet because we need named
  captures, which RegexSet doesn't provide. Keep the number of
  extract patterns small (< 10) or performance becomes visible.
- Capture allocation is the dominant cost on a happy path. If you
  only care about one field, write the regex tightly around it
  rather than capturing everything and throwing most away.

## Testing your filter

`ripgrab --no-follow --match '...' --exclude '...' file.log` is a
one-shot mode that prints what passes and exits. Handy for
iterating on a pattern before you leave it running.

Pair with `--extract` to sanity-check capture names:

    printf 'foo rid=abc123 bar\n' | ripgrab --no-follow \
      --extract 'rid=(?P<rid>\w+)' /dev/stdin

(The `/dev/stdin` usage is supported; any readable file is valid.)

## The grammar, strict form

For completeness, here's the grammar I would implement if I were
writing a parser. Today this is implicit in clap's argument parsing.

    FilterSpec ::= { "--match" Regex
                   | "--exclude" Regex
                   | "--extract" ExtractRegex
                   | "--since" Duration
                   }
    Regex      ::= <Rust regex syntax>
    ExtractRegex ::= Regex containing one or more (?P<name>...) groups
    Duration   ::= Digit+ ("s" | "m" | "h" | "d")