The Gap DuckDB Fills
Analytics workloads occupy an awkward position in the data tooling landscape. pandas is fast to write but runs out of RAM on datasets larger than a few gigabytes. Apache Spark handles arbitrary scale but demands a cluster, JVM tuning, a 10-minute cold-start, and an operator. SQL databases like PostgreSQL are optimised for OLTP row-level operations, not full-table scans and aggregations. For years, the only middle ground was spinning up a cloud warehouse — BigQuery, Snowflake, Redshift — and paying query costs for every exploration.
DuckDB fills this gap. It is an in-process analytical SQL engine — embedded in your Python process, no server, no network round-trip — that executes columnar vectorised queries using all available CPU cores. A laptop query over a 50 GB Parquet file on S3 that would time out in pandas and cost dollars in BigQuery completes in DuckDB in 8 seconds on local NVMe or 45 seconds over the network, using 2 GB of peak RAM.
The design choice that makes this possible is late materialisation: DuckDB reads only the columns referenced in a query and applies filters before expanding compressed columnar blocks, so scanning a 100-column Parquet file for a three-column aggregation reads roughly 3% of the bytes that a row-oriented engine would read.
pandas
Great DX, Python-native. Hits RAM ceiling at ~10 GB. Single-threaded by default.
DuckDB
Columnar SQL, multi-core, streaming execution. Sweet spot: 1 GB – 1 TB on one machine.
Apache Spark
Distributed scale beyond a single machine. High operational overhead; overkill under 500 GB.
Installation and First Queries
DuckDB ships as a single Python wheel with no native dependencies beyond the operating system standard library. Install it with pip and you have a full analytical SQL engine.
# Install DuckDB
pip install duckdb
# Optional: Arrow and pandas integration
pip install duckdb pyarrow pandas polarsThe Python API exposes a connection object that accepts SQL strings. Queries can reference Python variables directly via FROM read_parquet(), in-memory pandas DataFrames by name, or local DuckDB table objects. There is no ORM — you write SQL.
import duckdb
# In-memory connection (default)
con = duckdb.connect()
# Query a Parquet file directly — no load step
result = con.execute("""
SELECT
event_date,
COUNT(*) AS events,
SUM(revenue_usd) AS total_revenue,
AVG(latency_ms) AS avg_latency
FROM read_parquet('data/events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date
""").df() # .df() returns a pandas DataFrame
print(result.head())
# Persistent database file — survives process restarts
con2 = duckdb.connect("analytics.duckdb")
con2.execute("CREATE TABLE IF NOT EXISTS events AS SELECT * FROM result")
con2.close()Note
.duckdb file concurrently. For multi-threaded pipelines, create one connection per thread or use the connection pool pattern.Reading Parquet, CSV, and JSON Without a Load Step
DuckDB's table functions scan files directly without requiring an explicit LOAD or COPY step. Glob patterns, lists of files, and nested directory trees are all supported. The planner applies predicate and projection pushdown at the file level — it reads the Parquet row group statistics and skips entire row groups whose min/max ranges exclude the WHERE clause predicates.
-- Query Parquet with glob pattern (hive-partitioned layout)
SELECT region, product_category, SUM(quantity) AS units_sold
FROM read_parquet('s3://data-lake/sales/year=2026/month=*/day=*/*.parquet',
hive_partitioning = true)
WHERE region = 'EU' AND month BETWEEN 1 AND 3
GROUP BY 1, 2
ORDER BY units_sold DESC
LIMIT 20;
-- Auto-detect CSV schema — DuckDB samples the file and infers types
SELECT * FROM read_csv_auto('reports/monthly_summary.csv', header = true);
-- JSON lines (NDJSON) — one object per line
SELECT
json_extract_string(payload, '$.user_id') AS user_id,
json_extract_string(payload, '$.event_type') AS event_type,
epoch_ms(CAST(json_extract(payload, '$.ts') AS BIGINT)) AS event_time
FROM read_json('logs/2026-06-*.jsonl', format = 'newline_delimited');
-- Union multiple file types in one query
SELECT 'parquet' AS source, event_id, ts FROM read_parquet('archive/*.parquet')
UNION ALL
SELECT 'csv' AS source, event_id, ts FROM read_csv_auto('incoming/*.csv');For CSV files, read_csv_auto() samples up to 10,000 rows to infer column names, data types, delimiters, and quote characters. You can override any inference with explicit parameters: sep='|', columns={'id': 'INTEGER', ...}, dateformat='%d/%m/%Y'.
Reading from S3 and Remote Storage with httpfs
The httpfs extension adds S3-compatible object storage support. Once loaded, any s3:// path in a table function is transparently fetched over HTTPS. DuckDB requests only the byte ranges it needs for a specific query thanks to HTTP range requests combined with Parquet footer pre-fetching — it reads the file footer to get row group metadata, then fetches only the row groups that pass the pushed-down predicate filters.
-- Install and load the httpfs extension (bundled since DuckDB 0.10)
INSTALL httpfs;
LOAD httpfs;
-- Configure S3 credentials
CREATE SECRET aws_creds (
TYPE s3,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
REGION 'eu-west-1'
);
-- Or use environment variable credentials (IAM role / instance profile)
SET s3_use_ssl = true;
SET s3_url_style = 'path';
-- Query a partitioned Delta Lake table stored in S3
-- (requires the delta extension — see below)
SELECT COUNT(*) FROM delta_scan('s3://my-bucket/delta/orders/');
-- Direct Parquet scan — DuckDB fetches only needed row groups
SELECT
customer_id,
SUM(amount) AS lifetime_value
FROM read_parquet('s3://my-bucket/parquet/transactions/year=2026/**/*.parquet',
hive_partitioning = true)
GROUP BY customer_id
HAVING lifetime_value > 10000;Note
SET threads = 8 and SET s3_max_parallel_requests = 16 to saturate your uplink and reduce wall-clock time significantly.Apache Arrow Zero-Copy Integration with Pandas and Polars
DuckDB's internal execution engine is built on Apache Arrow columnar memory. This means that converting DuckDB query results to pandas or Polars DataFrames — and vice versa — involves zero data copying when the memory layouts are compatible. There is no serialisation, no CSV round-trip, no pickle protocol: the Arrow buffers produced by DuckDB are handed directly to the receiving library via the C Data Interface.
This zero-copy interop is especially valuable when combining DuckDB with Polars for large-scale processing — DuckDB handles the SQL aggregation and filter pushdown against Parquet files, then hands the result Arrow table directly to Polars for complex window logic or string transformations that are more expressive in the Polars expression API.
import duckdb
import pandas as pd
import polars as pl
con = duckdb.connect()
# ── Pandas ──────────────────────────────────────────────────────────────
# .df() returns pandas DataFrame (Arrow-backed via pyarrow)
df_pd: pd.DataFrame = con.execute("""
SELECT customer_id, SUM(amount) AS total
FROM read_parquet('transactions.parquet')
GROUP BY customer_id
""").df()
# Register a pandas DataFrame as a DuckDB virtual table
con.register("customers_df", df_pd)
result = con.execute("SELECT * FROM customers_df WHERE total > 500").df()
# ── Polars ───────────────────────────────────────────────────────────────
# .pl() returns a Polars DataFrame (zero-copy Arrow exchange)
df_pl: pl.DataFrame = con.execute("""
SELECT year, month, SUM(revenue) AS revenue
FROM read_parquet('sales/**/*.parquet', hive_partitioning = true)
GROUP BY year, month
ORDER BY year, month
""").pl()
# Apply Polars window expression on the DuckDB result
df_enriched = df_pl.with_columns(
pl.col("revenue").rolling_sum(window_size=3).over("year").alias("rolling_3m")
)
# ── Arrow ────────────────────────────────────────────────────────────────
# .arrow() returns a pyarrow.Table — pass to any Arrow-compatible system
arrow_table = con.execute("SELECT * FROM events LIMIT 1000000").arrow()
# Feed an Arrow table back into DuckDB (zero-copy)
con.register("arrow_events", arrow_table)
con.execute("SELECT COUNT(*) FROM arrow_events WHERE status = 'error'").fetchone()Advanced SQL: Window Functions, QUALIFY, and PIVOT
DuckDB implements the full SQL:2023 standard including window functions, common table expressions, lateral joins, QUALIFY for post-window filtering, and PIVOT/UNPIVOT for reshaping. These SQL features replace complex Python reshaping code and execute as vectorised columnar operations.
-- QUALIFY: filter rows based on window function results
-- Find the most recent order per customer (no subquery needed)
SELECT
customer_id,
order_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
QUALIFY rn = 1;
-- Running totals with percent-of-total using window functions
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS cumulative_revenue,
daily_revenue / SUM(daily_revenue) OVER () * 100 AS pct_of_total
FROM (
SELECT order_date, SUM(amount) AS daily_revenue
FROM orders
GROUP BY order_date
);
-- PIVOT: reshape metrics from rows to columns
PIVOT (
SELECT product, month, revenue
FROM monthly_revenue
WHERE year = 2026
)
ON month
USING SUM(revenue)
GROUP BY product;
-- UNNEST: expand array columns into rows
SELECT
user_id,
UNNEST(purchase_history) AS purchased_sku
FROM user_profiles
WHERE array_length(purchase_history) > 0;
-- Approximate COUNT DISTINCT with HLL sketch (huge performance win on big data)
SELECT
event_date,
APPROX_COUNT_DISTINCT(user_id) AS dau_approx,
COUNT(DISTINCT user_id) AS dau_exact -- slower
FROM events
GROUP BY event_date;Performance Tuning: Memory, Threads, and Persistent Storage
DuckDB uses streaming execution with a configurable memory budget. When query intermediate results exceed the memory limit, it spills to disk automatically — enabling out-of-core processing without changing query syntax. The key performance levers are thread count, memory limit, and storage format.
Unlike Apache Spark where performance tuning involves executor memory topology and shuffle partitions, DuckDB tuning is a handful of PRAGMA settings. Most production workloads need only memory_limit and threads.
import duckdb
con = duckdb.connect("analytics.duckdb", config={
"threads": 16, # default: number of CPU cores
"memory_limit": "24GB", # default: 80% of available RAM
"temp_directory": "/mnt/fast-nvme/duckdb-tmp", # spill location
})
# These can also be SET at query time
con.execute("SET threads = 8")
con.execute("SET memory_limit = '16GB'")
# ── Persistent storage format ─────────────────────────────────────────
# DuckDB's native .duckdb file uses its own columnar storage (ART index,
# dictionary encoding, RLE, bit-packing). For repeated queries on the
# same dataset, materialise to a persistent table:
con.execute("""
CREATE TABLE IF NOT EXISTS orders_2026 AS
SELECT * FROM read_parquet('s3://bucket/orders/year=2026/**/*.parquet',
hive_partitioning = true)
""")
# Create a zone-map friendly sort order for range scans
con.execute("CREATE INDEX ON orders_2026 (order_date)")
# ── Profiling: EXPLAIN ANALYZE ────────────────────────────────────────
plan = con.execute("""
EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM orders_2026
WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
GROUP BY customer_id
""").fetchall()
for row in plan:
print(row[1]) # prints physical plan with timing per operatorNote
memory_limithigh enough to avoid spills — spilling to disk reduces throughput by 5-10×. For datasets that don't fit, benchmark with NVMe storage for the temp directory; DuckDB's spill I/O pattern is sequential and benefits enormously from NVMe over spinning disks.Extension Ecosystem: Delta Lake, Iceberg, Spatial, and More
DuckDB's extension system loads additional table functions, data types, and SQL syntax at runtime without recompiling the engine. Extensions are downloaded from a signed repository and cached locally. The most useful production extensions are:
-- Install and load extensions (auto-loaded since DuckDB 1.0 for bundled extensions)
INSTALL delta; LOAD delta; -- Read Delta Lake tables (Rust kernel)
INSTALL iceberg; LOAD iceberg; -- Read Apache Iceberg tables
INSTALL httpfs; LOAD httpfs; -- S3, GCS, Azure Blob, HTTP
INSTALL spatial; LOAD spatial; -- 50+ geospatial functions (PostGIS-compatible)
INSTALL json; LOAD json; -- Enhanced JSON functions
INSTALL fts; LOAD fts; -- Full-text search index
-- ── Delta Lake reader ─────────────────────────────────────────────────
-- Read a Delta table with time travel
SELECT * FROM delta_scan('s3://bucket/delta/orders/')
LIMIT 100;
-- Read a specific version (snapshot)
SELECT COUNT(*) FROM delta_scan('s3://bucket/delta/orders/',
version = 42);
-- ── Apache Iceberg reader ─────────────────────────────────────────────
SELECT * FROM iceberg_scan('s3://bucket/iceberg/events/metadata/v3.metadata.json')
WHERE event_date = '2026-06-01';
-- ── Spatial extension ─────────────────────────────────────────────────
SELECT
store_id,
ST_AsText(location) AS geom_wkt,
ST_Distance(location,
ST_Point(13.405, 52.520)) AS dist_to_berlin_km
FROM stores
WHERE ST_DWithin(location, ST_Point(13.405, 52.520), 0.5); -- ~50 kmThe Delta Lake and Iceberg extensions use the same open-source kernels as the production writers — delta-rs for Delta and PyIceberg for Iceberg — so they correctly read transaction logs, apply deletion vectors, and respect schema evolution. This is discussed more in our guide on choosing between Delta Lake, Iceberg, and Hudi.
dbt-duckdb: Fast Local Development and CI Profiles
The dbt-duckdb adapter makes DuckDB a first-class dbt target, enabling full dbt model runs — including incremental models, snapshots, and tests — with zero infrastructure. A developer who previously needed a Snowflake account to run dbt run locally can now work entirely offline. CI pipelines that previously needed cloud warehouse credentials can run dbt builds in 30 seconds against DuckDB instead of 10 minutes against Snowflake.
# profiles.yml — DuckDB as a dbt target
local_dev:
target: dev
outputs:
dev:
type: duckdb
path: "{{ env_var('DBT_DUCKDB_PATH', 'dev.duckdb') }}"
threads: 8
extensions:
- httpfs
- delta
settings:
memory_limit: "8GB"
s3_region: eu-west-1
s3_use_ssl: "true"
# CI profile: read from S3 Parquet, write to in-memory DuckDB
ci:
type: duckdb
path: ":memory:"
threads: 4
extensions:
- httpfs
settings:
memory_limit: "4GB"-- models/mart/customer_lifetime_value.sql
-- Runs in DuckDB locally; promoted to Snowflake/BigQuery in production
{{
config(
materialized = 'incremental',
unique_key = 'customer_id',
on_schema_change = 'sync_all_columns',
)
}}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE order_date > (SELECT MAX(last_order_date) FROM {{ this }})
{% endif %}
),
cohort AS (
SELECT
customer_id,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
COUNT(*) AS order_count,
SUM(amount) AS lifetime_value,
AVG(amount) AS avg_order_value,
DATE_DIFF('day', MIN(order_date), MAX(order_date)) AS customer_age_days
FROM orders
GROUP BY customer_id
)
SELECT
customer_id,
first_order_date,
last_order_date,
order_count,
lifetime_value,
avg_order_value,
customer_age_days,
NTILE(10) OVER (ORDER BY lifetime_value) AS ltv_decile
FROM cohortNote
:memory:). The in-memory profile is faster for CI — no disk I/O — but you lose intermediate state between runs. For debugging a CI failure locally, switch to the persistent profile and inspect the DuckDB file with the DuckDB shell.MotherDuck: Cloud Collaboration and SQL Federation
MotherDuck is a managed DuckDB cloud service. It persists DuckDB databases in the cloud, enables collaborative access for teams, and adds transparent SQL federation — a single query can JOIN a local table on your laptop with a remote MotherDuck table without moving data, using hybrid execution where the local DuckDB instance handles local data and the MotherDuck cloud handles remote data.
# Connect to MotherDuck — token from MOTHERDUCK_TOKEN env variable
import duckdb, os
con = duckdb.connect(f"md:my_database?motherduck_token={os.environ['MOTHERDUCK_TOKEN']}")
# Tables in MotherDuck are prefixed with md:
con.execute("""
CREATE TABLE IF NOT EXISTS md:my_database.orders AS
SELECT * FROM read_parquet('local_data/orders.parquet')
""")
# Hybrid execution: JOIN local CSV with remote MotherDuck table
# DuckDB automatically routes parts of the query to the right engine
result = con.execute("""
SELECT
l.customer_id,
l.email,
m.lifetime_value,
m.ltv_decile
FROM read_csv_auto('local_customers.csv') AS l
JOIN md:my_database.customer_lifetime_value AS m
ON l.customer_id = m.customer_id
WHERE m.ltv_decile = 10
""").df()
# Share a query result as a MotherDuck share link
con.execute("""
CREATE OR REPLACE VIEW md:my_database.executive_summary AS
SELECT region, SUM(revenue) AS total_revenue
FROM md:my_database.orders
GROUP BY region
""")Production Pipeline Integration: Dagster and CLI
DuckDB integrates naturally into orchestration frameworks. In Dagster, you can define a DuckDBResource that provides a managed connection to each asset, with automatic cleanup. In the CLI, the duckdb command accepts SQL scripts with -c or pipes, enabling shell-script pipelines without a Python wrapper.
# dagster_duckdb_assets.py — DuckDB as a Dagster resource
from dagster import asset, Definitions, EnvVar
from dagster_duckdb import DuckDBResource
@asset(
group_name="analytics",
required_resource_keys={"duckdb"},
)
def customer_segments(context, duckdb: DuckDBResource):
with duckdb.get_connection() as con:
con.execute("""
CREATE OR REPLACE TABLE customer_segments AS
SELECT
customer_id,
CASE
WHEN lifetime_value > 10000 THEN 'platinum'
WHEN lifetime_value > 1000 THEN 'gold'
WHEN lifetime_value > 100 THEN 'silver'
ELSE 'bronze'
END AS segment,
lifetime_value
FROM read_parquet('s3://bucket/ltv/latest/*.parquet')
""")
row_count = con.execute(
"SELECT COUNT(*) FROM customer_segments"
).fetchone()[0]
context.add_output_metadata({"row_count": row_count})
defs = Definitions(
assets=[customer_segments],
resources={
"duckdb": DuckDBResource(
database=EnvVar("DUCKDB_DATABASE"),
)
},
)#!/bin/bash
# shell pipeline: DuckDB as a CLI ETL tool
# Export aggregated Parquet from S3 to CSV for a downstream system
duckdb :memory: -c "
INSTALL httpfs; LOAD httpfs;
SET s3_region = 'eu-west-1';
COPY (
SELECT
customer_id,
SUM(amount) AS total_spend,
COUNT(*) AS order_count,
MAX(order_date) AS last_order_date
FROM read_parquet('s3://bucket/orders/year=2026/**/*.parquet',
hive_partitioning = true)
GROUP BY customer_id
HAVING total_spend > 100
)
TO 'output/high_value_customers.csv' (HEADER, DELIMITER ',');
"
echo "Export complete: output/high_value_customers.csv"When to Use DuckDB vs pandas vs Spark
DuckDB is not a universal replacement. Understanding where it excels and where other tools are the right choice prevents tool-chasing and keeps pipelines maintainable.
| Scenario | Best Tool | Reason |
|---|---|---|
| Exploratory analysis, <2 GB data | pandas | Rich ecosystem, familiar API, Jupyter integration |
| SQL aggregations, 1 GB – 1 TB, single machine | DuckDB | Columnar execution, predicate pushdown, Arrow interop |
| Complex Python transformations, lazy execution | Polars + DuckDB | DuckDB for SQL aggregation, Polars for expression API |
| Distributed datasets >1 TB, multi-node | Spark / Trino | DuckDB is single-node only; no horizontal scale |
| OLTP write-heavy workloads | PostgreSQL | DuckDB is OLAP-optimised; not suited for concurrent writes |
| dbt local dev and CI without a cloud warehouse | dbt-duckdb | No credentials needed; fast builds; portable |
| High-concurrency multi-user analytics server | ClickHouse | Server-based columnar engine built for concurrent query serving; DuckDB is single-process |
| Collaborative analytics, cloud persistence | MotherDuck | Managed DuckDB with team access and hybrid execution |
DuckDB Production Checklist
Before deploying DuckDB in a production pipeline, verify each of the following:
Memory limit is explicitly set — default is 80% of total RAM which may starve co-located processes
Temp directory points to fast NVMe storage when queries spill to disk (datasets > memory_limit)
S3 credentials are passed via secrets or IAM roles, never hardcoded in SQL or Python
Persistent .duckdb files are excluded from version control (.gitignore) — they can be gigabytes
Concurrent writers are serialised — only one writer connection per DuckDB file at a time
Arrow zero-copy paths are verified with duckdb.explain() to confirm no unexpected materialisation
Extension versions are pinned in CI to prevent silent API changes across DuckDB upgrades
COPY TO Parquet output uses COMPRESSION 'zstd' and ROW_GROUP_SIZE 100000 for downstream compatibility
Running analytical queries on large local or S3 Parquet files and hitting pandas memory limits without wanting the overhead of a Spark cluster?
We design and implement DuckDB-based data pipelines — from httpfs extension configuration for direct S3 Parquet scanning with predicate pushdown and Apache Arrow zero-copy integration with Pandas and Polars, to dbt-duckdb profile setup for fast local development and CI without cloud warehouse credentials, Delta Lake and Iceberg extension configuration, MotherDuck cloud deployment with hybrid execution, and DuckDB integration with Dagster assets and shell-based ETL pipelines. Let’s talk.
Let's Talk