Streaming a large CSV in Python without loading it
Half the time I reach for pandas, I do not actually need a DataFrame. I need to read a big CSV row by row, filter or project a few fields, and write the result somewhere. csv.DictReader is the right answer, but the built-in version does not show progress and silently eats malformed rows on some dialects. This is the wrapper I keep in my scratch directory.
import csv, os, sys, time
def stream_csv(path, *, encoding="utf-8", tick=100_000):
size = os.path.getsize(path)
with open(path, encoding=encoding, newline="") as f:
reader = csv.DictReader(f)
start = time.monotonic()
for i, row in enumerate(reader, 1):
yield row
if i % tick == 0:
pct = f.tell() / size * 100
dt = time.monotonic() - start
rate = i / dt
print(
f" {i:>10,} rows {pct:5.1f}% {rate:,.0f} rows/s",
file=sys.stderr,
)
if __name__ == "__main__":
total = 0
for row in stream_csv(sys.argv[1]):
if row.get("status") == "error":
total += 1
print(f"errors: {total:,}")
It streams, reports progress to stderr (so piping the output still works), and gives a real rows-per-second number so I know whether the filter step is the bottleneck or the decoder is.
See also /posts/flaky-tests-triage-workflow/.