# Filter DSL
ripgrab's "DSL" is deliberately thin: the regex dialect from the
`regex` crate, plus three flags that compose. This document describes
exactly what each flag accepts, how the flags interact, and how
named captures become extracted fields.
If you're looking for the big picture, see
[docs/architecture.md](/src/ripgrab/docs-architecture-md/).
## Grammar
In EBNF-ish:
filter = include* exclude* extract* since?
include = "--match" regex
exclude = "--exclude" regex
extract = "--extract" regex_with_named_captures
since = "--since" duration
regex = any regex supported by the `regex` crate
(Rust regex syntax, no lookarounds, no backrefs)
duration = <number><unit>
unit = "s" | "m" | "h" | "d"
A line passes the filter iff:
(includes is empty OR at least one include matches) AND
(no exclude matches) AND
(since is unset OR line.when >= now - since)
The order of flags on the command line does not matter. You can
repeat each flag as many times as you want.
## Regex dialect
The underlying engine is `regex`, which is Rust's default. Important
differences from PCRE:
- No lookahead or lookbehind. If you need these, pre-filter the log.
- No backreferences. Same reason.
- Unicode by default. `\w` matches letters in any script. Use
`(?-u:\w)` for ASCII-only.
- Case-insensitive with `(?i)` inline flag: `(?i)error`.
Nothing here should surprise you if you've used ripgrep.
## Includes
ripgrab --match 'timeout|refused' app.log
Multiple `--match` flags OR together:
ripgrab --match 'error' --match 'panic' app.log
is equivalent to `ripgrab --match 'error|panic' app.log`. The first
form builds a RegexSet internally; on large lists the set form is
faster because the engine can prune alternatives in a single pass.
When no `--match` is present, every line is considered for the next
stage.
## Excludes
ripgrab --exclude 'GET /healthz' --exclude 'GET /metrics' app.log
Any `--exclude` match rejects the line. Applied after includes.
The usual pattern I use:
ripgrab \
--match 'ERROR|WARN' \
--exclude 'context canceled' \
--exclude 'transport is closing' \
service.log
That is: "show me errors and warnings, but I know about these two
noisy ones."
## Extract
ripgrab --extract '(?P<rid>[a-f0-9]{16}).*latency=(?P<ms>\d+)ms' api.log
Every named capture becomes a column in the output. On a match:
rid ms line
a1b2c3d4e5f60718 142 ...original line truncated to width...
Unmatched lines continue in the default stream output. That interplay
is deliberate: you can have a running commentary of everything
happening (stream mode) and still promote structured events (table
mode) without losing context.
You can have several `--extract` patterns. They are tried in order;
the first match wins for a given line. Capture names across patterns
should agree if you want a single coherent table; if they don't, you
get a wider table with `-` in the unfilled cells.
Named capture syntax is `(?P<name>...)`. Rust's regex crate also
accepts `(?<name>...)` since 1.10.0; either is fine.
### Capture group naming rules
- Must start with a letter
- Letters, digits, and underscores only
- Duplicate names across patterns: last one wins in that line
- Anonymous groups (`(...)`) are allowed but ignored in output
- Empty matches don't produce a row; the line falls to stream mode
## Since
ripgrab --since 10m app.log
ripgrab --since 2h app.log
ripgrab --since 1d audit.log
Parsed in the obvious way. `s`, `m`, `h`, `d`. Combining units is
not supported; pick the coarsest.
`--since` works on the capture time, not the log's embedded
timestamp. For follow mode this is moot (every line is captured
"now"). For `--no-follow` it means ripgrab skips stale lines from
the file's past, using `st_mtime` as a hint for where to seek.
## Interaction examples
1. Extract request ids for errors only:
ripgrab \
--match 'ERROR' \
--extract 'rid=(?P<rid>[a-f0-9]{16})' \
api.log
Lines without `ERROR` are dropped. Lines with `ERROR` but no `rid=`
show in stream mode. Lines with both show in the table.
2. Merge two files, tag source, exclude a known flake:
ripgrab \
--exclude 'flaky-test-name' \
svc-a.log svc-b.log
Source label is the filename.
3. Show structured latency, ignore fast requests:
ripgrab \
--extract 'path=(?P<p>\S+) ms=(?P<ms>\d+)' \
--match 'ms=[0-9]{4,}' \
access.log
`--match` ensures 4+ digit ms values; extract gives you the
table. Note ripgrab doesn't sort or aggregate - pipe to `sort`
or `awk` if you need that.
## Precedence and pitfalls
- Filters apply in order: includes, excludes, extract, since. An
`--exclude` that matches a line cannot be overridden by an
`--extract`.
- Anchors matter: `^ERROR` is much faster than `ERROR` on long lines.
The regex engine prefix-literal-optimises where it can; anchors
help it.
- Unicode case folding is expensive. If your input is ASCII, use
`(?i-u:foo)` instead of `(?i)foo`.
## What the DSL will never have
Listed here so I remember when someone asks:
- **No "and" except via composition of flags.** Combining patterns
with AND is something the user should express by writing a single
regex like `(?=.*foo)(?=.*bar)` - which ripgrab's engine doesn't
support, because that is a lookaround. If you need that, pre-grep
or use `ripgrep`'s `--multiline`. ripgrab stays out of this.
- **No arithmetic on captures.** Once you want to compare `ms > 500`
you have left "quick tail filter" territory and want something
like Vector, Bento, or a proper query engine.
- **No config files.** The flags are the config.
- **No callbacks / hooks.** Stays a read-only pipe.
## Performance notes
- `--match` and `--exclude` compile into a single `RegexSet` each.
That means "match any of N patterns" runs in time proportional to
the line length, not N.
- `--extract` patterns don't share a RegexSet because we need named
captures, which RegexSet doesn't provide. Keep the number of
extract patterns small (< 10) or performance becomes visible.
- Capture allocation is the dominant cost on a happy path. If you
only care about one field, write the regex tightly around it
rather than capturing everything and throwing most away.
## Testing your filter
`ripgrab --no-follow --match '...' --exclude '...' file.log` is a
one-shot mode that prints what passes and exits. Handy for
iterating on a pattern before you leave it running.
Pair with `--extract` to sanity-check capture names:
printf 'foo rid=abc123 bar\n' | ripgrab --no-follow \
--extract 'rid=(?P<rid>\w+)' /dev/stdin
(The `/dev/stdin` usage is supported; any readable file is valid.)
## The grammar, strict form
For completeness, here's the grammar I would implement if I were
writing a parser. Today this is implicit in clap's argument parsing.
FilterSpec ::= { "--match" Regex
| "--exclude" Regex
| "--extract" ExtractRegex
| "--since" Duration
}
Regex ::= <Rust regex syntax>
ExtractRegex ::= Regex containing one or more (?P<name>...) groups
Duration ::= Digit+ ("s" | "m" | "h" | "d")