TIL: DuckDB's COPY reads from S3 URLs directly
Was debugging a data issue that lived in a Parquet file on S3. My reflex was to aws s3 cp it locally, then open it. DuckDB can just read it:
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-west-2';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';
SELECT * FROM read_parquet('s3://mybucket/path/to/events/*.parquet') LIMIT 10;
No downloads. DuckDB does HTTP range requests against the Parquet file to read just the columns and row groups it needs for your query. For a 2GB Parquet file where I only need one column for 100 rows, I’ve seen ~50KB of transfer instead of the full 2GB.
Same pattern works for CSV:
SELECT * FROM read_csv('s3://bucket/path/*.csv');
And JSON:
SELECT * FROM read_json('s3://bucket/path/*.json');
Glob patterns work — useful for “this dataset is split across 400 files.”
If you don’t want to type AWS creds inline, the httpfs extension picks up environment variables and ~/.aws/credentials profile-style config. Or you can use the CREATE SECRET command in DuckDB 0.10+.
This has completely changed my “quickly inspect this data” workflow. Used to be a 10-minute download plus local setup. Now it’s a single SQL query from inside DuckDB.