What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

DuckDB for Analytical Workloads — Columnar SQL, Arrow Integration, and In-Process Analytics

The Gap DuckDB Fills

Analytics workloads occupy an awkward position in the data tooling landscape. pandas is fast to write but runs out of RAM on datasets larger than a few gigabytes. Apache Spark handles arbitrary scale but demands a cluster, JVM tuning, a 10-minute cold-start, and an operator. SQL databases like PostgreSQL are optimised for OLTP row-level operations, not full-table scans and aggregations. For years, the only middle ground was spinning up a cloud warehouse — BigQuery, Snowflake, Redshift — and paying query costs for every exploration.

DuckDB fills this gap. It is an in-process analytical SQL engine — embedded in your Python process, no server, no network round-trip — that executes columnar vectorised queries using all available CPU cores. A laptop query over a 50 GB Parquet file on S3 that would time out in pandas and cost dollars in BigQuery completes in DuckDB in 8 seconds on local NVMe or 45 seconds over the network, using 2 GB of peak RAM.

The design choice that makes this possible is late materialisation: DuckDB reads only the columns referenced in a query and applies filters before expanding compressed columnar blocks, so scanning a 100-column Parquet file for a three-column aggregation reads roughly 3% of the bytes that a row-oriented engine would read.

pandas

Great DX, Python-native. Hits RAM ceiling at ~10 GB. Single-threaded by default.

DuckDB

Columnar SQL, multi-core, streaming execution. Sweet spot: 1 GB – 1 TB on one machine.

Apache Spark

Distributed scale beyond a single machine. High operational overhead; overkill under 500 GB.

Installation and First Queries

DuckDB ships as a single Python wheel with no native dependencies beyond the operating system standard library. Install it with pip and you have a full analytical SQL engine.

# Install DuckDB
pip install duckdb

# Optional: Arrow and pandas integration
pip install duckdb pyarrow pandas polars

The Python API exposes a connection object that accepts SQL strings. Queries can reference Python variables directly via FROM read_parquet(), in-memory pandas DataFrames by name, or local DuckDB table objects. There is no ORM — you write SQL.

import duckdb

# In-memory connection (default)
con = duckdb.connect()

# Query a Parquet file directly — no load step
result = con.execute("""
    SELECT
        event_date,
        COUNT(*)            AS events,
        SUM(revenue_usd)    AS total_revenue,
        AVG(latency_ms)     AS avg_latency
    FROM read_parquet('data/events/*.parquet')
    WHERE event_date >= '2026-01-01'
    GROUP BY event_date
    ORDER BY event_date
""").df()          # .df() returns a pandas DataFrame

print(result.head())

# Persistent database file — survives process restarts
con2 = duckdb.connect("analytics.duckdb")
con2.execute("CREATE TABLE IF NOT EXISTS events AS SELECT * FROM result")
con2.close()

Note

DuckDB connections are not thread-safe when writing, but multiple read-only connections can open the same .duckdb file concurrently. For multi-threaded pipelines, create one connection per thread or use the connection pool pattern.

Reading Parquet, CSV, and JSON Without a Load Step

DuckDB's table functions scan files directly without requiring an explicit LOAD or COPY step. Glob patterns, lists of files, and nested directory trees are all supported. The planner applies predicate and projection pushdown at the file level — it reads the Parquet row group statistics and skips entire row groups whose min/max ranges exclude the WHERE clause predicates.

-- Query Parquet with glob pattern (hive-partitioned layout)
SELECT region, product_category, SUM(quantity) AS units_sold
FROM read_parquet('s3://data-lake/sales/year=2026/month=*/day=*/*.parquet',
                  hive_partitioning = true)
WHERE region = 'EU' AND month BETWEEN 1 AND 3
GROUP BY 1, 2
ORDER BY units_sold DESC
LIMIT 20;

-- Auto-detect CSV schema — DuckDB samples the file and infers types
SELECT * FROM read_csv_auto('reports/monthly_summary.csv', header = true);

-- JSON lines (NDJSON) — one object per line
SELECT
    json_extract_string(payload, '$.user_id')     AS user_id,
    json_extract_string(payload, '$.event_type')  AS event_type,
    epoch_ms(CAST(json_extract(payload, '$.ts') AS BIGINT)) AS event_time
FROM read_json('logs/2026-06-*.jsonl', format = 'newline_delimited');

-- Union multiple file types in one query
SELECT 'parquet' AS source, event_id, ts FROM read_parquet('archive/*.parquet')
UNION ALL
SELECT 'csv'     AS source, event_id, ts FROM read_csv_auto('incoming/*.csv');

For CSV files, read_csv_auto() samples up to 10,000 rows to infer column names, data types, delimiters, and quote characters. You can override any inference with explicit parameters: sep='|', columns={'id': 'INTEGER', ...}, dateformat='%d/%m/%Y'.

Reading from S3 and Remote Storage with httpfs

The httpfs extension adds S3-compatible object storage support. Once loaded, any s3:// path in a table function is transparently fetched over HTTPS. DuckDB requests only the byte ranges it needs for a specific query thanks to HTTP range requests combined with Parquet footer pre-fetching — it reads the file footer to get row group metadata, then fetches only the row groups that pass the pushed-down predicate filters.

-- Install and load the httpfs extension (bundled since DuckDB 0.10)
INSTALL httpfs;
LOAD httpfs;

-- Configure S3 credentials
CREATE SECRET aws_creds (
    TYPE s3,
    KEY_ID     'AKIAIOSFODNN7EXAMPLE',
    SECRET     'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    REGION     'eu-west-1'
);

-- Or use environment variable credentials (IAM role / instance profile)
SET s3_use_ssl = true;
SET s3_url_style = 'path';

-- Query a partitioned Delta Lake table stored in S3
-- (requires the delta extension — see below)
SELECT COUNT(*) FROM delta_scan('s3://my-bucket/delta/orders/');

-- Direct Parquet scan — DuckDB fetches only needed row groups
SELECT
    customer_id,
    SUM(amount) AS lifetime_value
FROM read_parquet('s3://my-bucket/parquet/transactions/year=2026/**/*.parquet',
                  hive_partitioning = true)
GROUP BY customer_id
HAVING lifetime_value > 10000;

Note

When scanning S3 from a laptop, network throughput is usually the bottleneck, not CPU. Enable parallel S3 fetches with SET threads = 8 and SET s3_max_parallel_requests = 16 to saturate your uplink and reduce wall-clock time significantly.

Apache Arrow Zero-Copy Integration with Pandas and Polars

DuckDB's internal execution engine is built on Apache Arrow columnar memory. This means that converting DuckDB query results to pandas or Polars DataFrames — and vice versa — involves zero data copying when the memory layouts are compatible. There is no serialisation, no CSV round-trip, no pickle protocol: the Arrow buffers produced by DuckDB are handed directly to the receiving library via the C Data Interface.

This zero-copy interop is especially valuable when combining DuckDB with Polars for large-scale processing — DuckDB handles the SQL aggregation and filter pushdown against Parquet files, then hands the result Arrow table directly to Polars for complex window logic or string transformations that are more expressive in the Polars expression API.

import duckdb
import pandas as pd
import polars as pl

con = duckdb.connect()

# ── Pandas ──────────────────────────────────────────────────────────────
# .df() returns pandas DataFrame (Arrow-backed via pyarrow)
df_pd: pd.DataFrame = con.execute("""
    SELECT customer_id, SUM(amount) AS total
    FROM read_parquet('transactions.parquet')
    GROUP BY customer_id
""").df()

# Register a pandas DataFrame as a DuckDB virtual table
con.register("customers_df", df_pd)
result = con.execute("SELECT * FROM customers_df WHERE total > 500").df()

# ── Polars ───────────────────────────────────────────────────────────────
# .pl() returns a Polars DataFrame (zero-copy Arrow exchange)
df_pl: pl.DataFrame = con.execute("""
    SELECT year, month, SUM(revenue) AS revenue
    FROM read_parquet('sales/**/*.parquet', hive_partitioning = true)
    GROUP BY year, month
    ORDER BY year, month
""").pl()

# Apply Polars window expression on the DuckDB result
df_enriched = df_pl.with_columns(
    pl.col("revenue").rolling_sum(window_size=3).over("year").alias("rolling_3m")
)

# ── Arrow ────────────────────────────────────────────────────────────────
# .arrow() returns a pyarrow.Table — pass to any Arrow-compatible system
arrow_table = con.execute("SELECT * FROM events LIMIT 1000000").arrow()

# Feed an Arrow table back into DuckDB (zero-copy)
con.register("arrow_events", arrow_table)
con.execute("SELECT COUNT(*) FROM arrow_events WHERE status = 'error'").fetchone()

Advanced SQL: Window Functions, QUALIFY, and PIVOT

DuckDB implements the full SQL:2023 standard including window functions, common table expressions, lateral joins, QUALIFY for post-window filtering, and PIVOT/UNPIVOT for reshaping. These SQL features replace complex Python reshaping code and execute as vectorised columnar operations.

-- QUALIFY: filter rows based on window function results
-- Find the most recent order per customer (no subquery needed)
SELECT
    customer_id,
    order_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
QUALIFY rn = 1;

-- Running totals with percent-of-total using window functions
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date
                             ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        AS cumulative_revenue,
    daily_revenue / SUM(daily_revenue) OVER () * 100 AS pct_of_total
FROM (
    SELECT order_date, SUM(amount) AS daily_revenue
    FROM orders
    GROUP BY order_date
);

-- PIVOT: reshape metrics from rows to columns
PIVOT (
    SELECT product, month, revenue
    FROM monthly_revenue
    WHERE year = 2026
)
ON month
USING SUM(revenue)
GROUP BY product;

-- UNNEST: expand array columns into rows
SELECT
    user_id,
    UNNEST(purchase_history) AS purchased_sku
FROM user_profiles
WHERE array_length(purchase_history) > 0;

-- Approximate COUNT DISTINCT with HLL sketch (huge performance win on big data)
SELECT
    event_date,
    APPROX_COUNT_DISTINCT(user_id) AS dau_approx,
    COUNT(DISTINCT user_id)        AS dau_exact   -- slower
FROM events
GROUP BY event_date;

Performance Tuning: Memory, Threads, and Persistent Storage

DuckDB uses streaming execution with a configurable memory budget. When query intermediate results exceed the memory limit, it spills to disk automatically — enabling out-of-core processing without changing query syntax. The key performance levers are thread count, memory limit, and storage format.

Unlike Apache Spark where performance tuning involves executor memory topology and shuffle partitions, DuckDB tuning is a handful of PRAGMA settings. Most production workloads need only memory_limit and threads.

import duckdb

con = duckdb.connect("analytics.duckdb", config={
    "threads": 16,          # default: number of CPU cores
    "memory_limit": "24GB", # default: 80% of available RAM
    "temp_directory": "/mnt/fast-nvme/duckdb-tmp",  # spill location
})

# These can also be SET at query time
con.execute("SET threads = 8")
con.execute("SET memory_limit = '16GB'")

# ── Persistent storage format ─────────────────────────────────────────
# DuckDB's native .duckdb file uses its own columnar storage (ART index,
# dictionary encoding, RLE, bit-packing). For repeated queries on the
# same dataset, materialise to a persistent table:

con.execute("""
    CREATE TABLE IF NOT EXISTS orders_2026 AS
    SELECT * FROM read_parquet('s3://bucket/orders/year=2026/**/*.parquet',
                               hive_partitioning = true)
""")

# Create a zone-map friendly sort order for range scans
con.execute("CREATE INDEX ON orders_2026 (order_date)")

# ── Profiling: EXPLAIN ANALYZE ────────────────────────────────────────
plan = con.execute("""
    EXPLAIN ANALYZE
    SELECT customer_id, SUM(amount)
    FROM orders_2026
    WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
    GROUP BY customer_id
""").fetchall()

for row in plan:
    print(row[1])  # prints physical plan with timing per operator

Note

For datasets that fit in memory, set memory_limithigh enough to avoid spills — spilling to disk reduces throughput by 5-10×. For datasets that don't fit, benchmark with NVMe storage for the temp directory; DuckDB's spill I/O pattern is sequential and benefits enormously from NVMe over spinning disks.

Extension Ecosystem: Delta Lake, Iceberg, Spatial, and More

DuckDB's extension system loads additional table functions, data types, and SQL syntax at runtime without recompiling the engine. Extensions are downloaded from a signed repository and cached locally. The most useful production extensions are:

-- Install and load extensions (auto-loaded since DuckDB 1.0 for bundled extensions)
INSTALL delta;   LOAD delta;    -- Read Delta Lake tables (Rust kernel)
INSTALL iceberg; LOAD iceberg;  -- Read Apache Iceberg tables
INSTALL httpfs;  LOAD httpfs;   -- S3, GCS, Azure Blob, HTTP
INSTALL spatial; LOAD spatial;  -- 50+ geospatial functions (PostGIS-compatible)
INSTALL json;    LOAD json;     -- Enhanced JSON functions
INSTALL fts;     LOAD fts;      -- Full-text search index

-- ── Delta Lake reader ─────────────────────────────────────────────────
-- Read a Delta table with time travel
SELECT * FROM delta_scan('s3://bucket/delta/orders/')
LIMIT 100;

-- Read a specific version (snapshot)
SELECT COUNT(*) FROM delta_scan('s3://bucket/delta/orders/',
                                version = 42);

-- ── Apache Iceberg reader ─────────────────────────────────────────────
SELECT * FROM iceberg_scan('s3://bucket/iceberg/events/metadata/v3.metadata.json')
WHERE event_date = '2026-06-01';

-- ── Spatial extension ─────────────────────────────────────────────────
SELECT
    store_id,
    ST_AsText(location)                          AS geom_wkt,
    ST_Distance(location,
                ST_Point(13.405, 52.520))        AS dist_to_berlin_km
FROM stores
WHERE ST_DWithin(location, ST_Point(13.405, 52.520), 0.5);  -- ~50 km

The Delta Lake and Iceberg extensions use the same open-source kernels as the production writers — delta-rs for Delta and PyIceberg for Iceberg — so they correctly read transaction logs, apply deletion vectors, and respect schema evolution. This is discussed more in our guide on choosing between Delta Lake, Iceberg, and Hudi.

dbt-duckdb: Fast Local Development and CI Profiles

The dbt-duckdb adapter makes DuckDB a first-class dbt target, enabling full dbt model runs — including incremental models, snapshots, and tests — with zero infrastructure. A developer who previously needed a Snowflake account to run dbt run locally can now work entirely offline. CI pipelines that previously needed cloud warehouse credentials can run dbt builds in 30 seconds against DuckDB instead of 10 minutes against Snowflake.

# profiles.yml — DuckDB as a dbt target
local_dev:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "{{ env_var('DBT_DUCKDB_PATH', 'dev.duckdb') }}"
      threads: 8
      extensions:
        - httpfs
        - delta
      settings:
        memory_limit: "8GB"
        s3_region: eu-west-1
        s3_use_ssl: "true"

    # CI profile: read from S3 Parquet, write to in-memory DuckDB
    ci:
      type: duckdb
      path: ":memory:"
      threads: 4
      extensions:
        - httpfs
      settings:
        memory_limit: "4GB"

-- models/mart/customer_lifetime_value.sql
-- Runs in DuckDB locally; promoted to Snowflake/BigQuery in production
{{
  config(
    materialized = 'incremental',
    unique_key   = 'customer_id',
    on_schema_change = 'sync_all_columns',
  )
}}

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
    {% if is_incremental() %}
    WHERE order_date > (SELECT MAX(last_order_date) FROM {{ this }})
    {% endif %}
),

cohort AS (
    SELECT
        customer_id,
        MIN(order_date)                              AS first_order_date,
        MAX(order_date)                              AS last_order_date,
        COUNT(*)                                     AS order_count,
        SUM(amount)                                  AS lifetime_value,
        AVG(amount)                                  AS avg_order_value,
        DATE_DIFF('day', MIN(order_date), MAX(order_date)) AS customer_age_days
    FROM orders
    GROUP BY customer_id
)

SELECT
    customer_id,
    first_order_date,
    last_order_date,
    order_count,
    lifetime_value,
    avg_order_value,
    customer_age_days,
    NTILE(10) OVER (ORDER BY lifetime_value) AS ltv_decile
FROM cohort

Note

Use separate dbt profiles for local development (persistent DuckDB file on disk) and CI (in-memory :memory:). The in-memory profile is faster for CI — no disk I/O — but you lose intermediate state between runs. For debugging a CI failure locally, switch to the persistent profile and inspect the DuckDB file with the DuckDB shell.

MotherDuck: Cloud Collaboration and SQL Federation

MotherDuck is a managed DuckDB cloud service. It persists DuckDB databases in the cloud, enables collaborative access for teams, and adds transparent SQL federation — a single query can JOIN a local table on your laptop with a remote MotherDuck table without moving data, using hybrid execution where the local DuckDB instance handles local data and the MotherDuck cloud handles remote data.

# Connect to MotherDuck — token from MOTHERDUCK_TOKEN env variable
import duckdb, os

con = duckdb.connect(f"md:my_database?motherduck_token={os.environ['MOTHERDUCK_TOKEN']}")

# Tables in MotherDuck are prefixed with md:
con.execute("""
    CREATE TABLE IF NOT EXISTS md:my_database.orders AS
    SELECT * FROM read_parquet('local_data/orders.parquet')
""")

# Hybrid execution: JOIN local CSV with remote MotherDuck table
# DuckDB automatically routes parts of the query to the right engine
result = con.execute("""
    SELECT
        l.customer_id,
        l.email,
        m.lifetime_value,
        m.ltv_decile
    FROM read_csv_auto('local_customers.csv') AS l
    JOIN md:my_database.customer_lifetime_value AS m
      ON l.customer_id = m.customer_id
    WHERE m.ltv_decile = 10
""").df()

# Share a query result as a MotherDuck share link
con.execute("""
    CREATE OR REPLACE VIEW md:my_database.executive_summary AS
    SELECT region, SUM(revenue) AS total_revenue
    FROM md:my_database.orders
    GROUP BY region
""")

Production Pipeline Integration: Dagster and CLI

DuckDB integrates naturally into orchestration frameworks. In Dagster, you can define a DuckDBResource that provides a managed connection to each asset, with automatic cleanup. In the CLI, the duckdb command accepts SQL scripts with -c or pipes, enabling shell-script pipelines without a Python wrapper.

# dagster_duckdb_assets.py — DuckDB as a Dagster resource
from dagster import asset, Definitions, EnvVar
from dagster_duckdb import DuckDBResource

@asset(
    group_name="analytics",
    required_resource_keys={"duckdb"},
)
def customer_segments(context, duckdb: DuckDBResource):
    with duckdb.get_connection() as con:
        con.execute("""
            CREATE OR REPLACE TABLE customer_segments AS
            SELECT
                customer_id,
                CASE
                    WHEN lifetime_value > 10000 THEN 'platinum'
                    WHEN lifetime_value > 1000  THEN 'gold'
                    WHEN lifetime_value > 100   THEN 'silver'
                    ELSE                             'bronze'
                END AS segment,
                lifetime_value
            FROM read_parquet('s3://bucket/ltv/latest/*.parquet')
        """)
        row_count = con.execute(
            "SELECT COUNT(*) FROM customer_segments"
        ).fetchone()[0]
    context.add_output_metadata({"row_count": row_count})


defs = Definitions(
    assets=[customer_segments],
    resources={
        "duckdb": DuckDBResource(
            database=EnvVar("DUCKDB_DATABASE"),
        )
    },
)

#!/bin/bash
# shell pipeline: DuckDB as a CLI ETL tool

# Export aggregated Parquet from S3 to CSV for a downstream system
duckdb :memory: -c "
INSTALL httpfs; LOAD httpfs;
SET s3_region = 'eu-west-1';
COPY (
    SELECT
        customer_id,
        SUM(amount)      AS total_spend,
        COUNT(*)         AS order_count,
        MAX(order_date)  AS last_order_date
    FROM read_parquet('s3://bucket/orders/year=2026/**/*.parquet',
                      hive_partitioning = true)
    GROUP BY customer_id
    HAVING total_spend > 100
)
TO 'output/high_value_customers.csv' (HEADER, DELIMITER ',');
"

echo "Export complete: output/high_value_customers.csv"

When to Use DuckDB vs pandas vs Spark

DuckDB is not a universal replacement. Understanding where it excels and where other tools are the right choice prevents tool-chasing and keeps pipelines maintainable.

Scenario	Best Tool	Reason
Exploratory analysis, <2 GB data	pandas	Rich ecosystem, familiar API, Jupyter integration
SQL aggregations, 1 GB – 1 TB, single machine	DuckDB	Columnar execution, predicate pushdown, Arrow interop
Complex Python transformations, lazy execution	Polars + DuckDB	DuckDB for SQL aggregation, Polars for expression API
Distributed datasets >1 TB, multi-node	Spark / Trino	DuckDB is single-node only; no horizontal scale
OLTP write-heavy workloads	PostgreSQL	DuckDB is OLAP-optimised; not suited for concurrent writes
dbt local dev and CI without a cloud warehouse	dbt-duckdb	No credentials needed; fast builds; portable
High-concurrency multi-user analytics server	ClickHouse	Server-based columnar engine built for concurrent query serving; DuckDB is single-process
Cross-source query federation across Iceberg, PostgreSQL, Kafka	Trino	Federated SQL engine with connectors for 20+ sources; DuckDB reads local files only
Collaborative analytics, cloud persistence	MotherDuck	Managed DuckDB with team access and hybrid execution

DuckDB Production Checklist

Before deploying DuckDB in a production pipeline, verify each of the following:

Memory limit is explicitly set — default is 80% of total RAM which may starve co-located processes

Temp directory points to fast NVMe storage when queries spill to disk (datasets > memory_limit)

S3 credentials are passed via secrets or IAM roles, never hardcoded in SQL or Python

Persistent .duckdb files are excluded from version control (.gitignore) — they can be gigabytes

Concurrent writers are serialised — only one writer connection per DuckDB file at a time

Arrow zero-copy paths are verified with duckdb.explain() to confirm no unexpected materialisation

Extension versions are pinned in CI to prevent silent API changes across DuckDB upgrades

COPY TO Parquet output uses COMPRESSION 'zstd' and ROW_GROUP_SIZE 100000 for downstream compatibility

For streaming analytics that DuckDB's batch model cannot handle — sub-second latency, stateful windowing, exactly-once delivery — see streaming analytics with Apache Flink. When DuckDB queries grow beyond single-machine memory, the natural next step is Trino distributed SQL — same SQL dialect, federated execution across Iceberg, PostgreSQL, and Kafka without data movement. For data engineering pipelines that use DuckDB for local development and testing before pushing to production Spark or Trino, see data pipeline testing patterns — DuckDB's in-process execution makes it the fastest integration test substrate for pipeline logic. When DuckDB is used alongside pandas DataFrames and Arrow zero-copy transfers are not enough to overcome pandas memory overhead, see Pandas 2 performance — Copy-on-Write semantics and the new ArrowDtype backend cut peak memory by 30-50% on the same workloads. For building interactive data applications and dashboards directly on DuckDB query results, see Streamlit data apps — its native DuckDB connector reads query results as DataFrames with zero serialisation overhead, making it the fastest path from a DuckDB query to a shareable dashboard. For SQL-native business intelligence where reports are version-controlled Markdown files and every chart is a SQL query compiled at build time, see Evidence data apps — Evidence uses DuckDB as its default query engine and ships to a static site with no BI server required.

Running analytical queries on large local or S3 Parquet files and hitting pandas memory limits without wanting the overhead of a Spark cluster?

We design and implement DuckDB-based data pipelines — from httpfs extension configuration for direct S3 Parquet scanning with predicate pushdown and Apache Arrow zero-copy integration with Pandas and Polars, to dbt-duckdb profile setup for fast local development and CI without cloud warehouse credentials, Delta Lake and Iceberg extension configuration, MotherDuck cloud deployment with hybrid execution, and DuckDB integration with Dagster assets and shell-based ETL pipelines. Let’s talk.

Let's Talk

DuckDB for Analytical Workloads — Columnar SQL, Arrow Integration, and In-Process Analytics

The Gap DuckDB Fills

Installation and First Queries

Reading Parquet, CSV, and JSON Without a Load Step

Reading from S3 and Remote Storage with httpfs

Apache Arrow Zero-Copy Integration with Pandas and Polars

Advanced SQL: Window Functions, QUALIFY, and PIVOT

Performance Tuning: Memory, Threads, and Persistent Storage

Extension Ecosystem: Delta Lake, Iceberg, Spatial, and More

dbt-duckdb: Fast Local Development and CI Profiles

MotherDuck: Cloud Collaboration and SQL Federation

Production Pipeline Integration: Dagster and CLI

When to Use DuckDB vs pandas vs Spark

DuckDB Production Checklist

Running analytical queries on large local or S3 Parquet files and hitting pandas memory limits without wanting the overhead of a Spark cluster?

Related Articles

Need help implementing this in production?