TIL: DuckDB's COPY reads from S3 URLs directly

Was debugging a data issue that lived in a Parquet file on S3. My reflex was to aws s3 cp it locally, then open it. DuckDB can just read it:

INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-west-2';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';

SELECT * FROM read_parquet('s3://mybucket/path/to/events/*.parquet') LIMIT 10;

No downloads. DuckDB does HTTP range requests against the Parquet file to read just the columns and row groups it needs for your query. For a 2GB Parquet file where I only need one column for 100 rows, I’ve seen ~50KB of transfer instead of the full 2GB.

Same pattern works for CSV:

SELECT * FROM read_csv('s3://bucket/path/*.csv');

And JSON:

SELECT * FROM read_json('s3://bucket/path/*.json');

Glob patterns work — useful for “this dataset is split across 400 files.”

If you don’t want to type AWS creds inline, the httpfs extension picks up environment variables and ~/.aws/credentials profile-style config. Or you can use the CREATE SECRET command in DuckDB 0.10+.

This has completely changed my “quickly inspect this data” workflow. Used to be a 10-minute download plus local setup. Now it’s a single SQL query from inside DuckDB.