Half the time I reach for pandas, I do not actually need a DataFrame. I need to read a big CSV row by row, filter or project a few fields, and write the result somewhere. csv.DictReader is the right answer, but the built-in version does not show progress and silently eats malformed rows on some dialects. This is the wrapper I keep in my scratch directory.

import csv, os, sys, time

def stream_csv(path, *, encoding="utf-8", tick=100_000):
    size = os.path.getsize(path)
    with open(path, encoding=encoding, newline="") as f:
        reader = csv.DictReader(f)
        start = time.monotonic()
        for i, row in enumerate(reader, 1):
            yield row
            if i % tick == 0:
                pct = f.tell() / size * 100
                dt = time.monotonic() - start
                rate = i / dt
                print(
                    f"  {i:>10,} rows  {pct:5.1f}%  {rate:,.0f} rows/s",
                    file=sys.stderr,
                )

if __name__ == "__main__":
    total = 0
    for row in stream_csv(sys.argv[1]):
        if row.get("status") == "error":
            total += 1
    print(f"errors: {total:,}")

It streams, reports progress to stderr (so piping the output still works), and gives a real rows-per-second number so I know whether the filter step is the bottleneck or the decoder is.

See also /posts/flaky-tests-triage-workflow/.