Skip to content

DuckDB Guide

In-Process SQL OLAP Database


Overview

DuckDB is an in-process SQL OLAP database, similar to SQLite but for analytics. It’s single-node, columnar, and optimized for analytical queries. DuckDB is ideal for local data analysis, notebooks, and workloads that don’t require distributed processing.


Core Concepts

DuckDB Architecture

Key Characteristics

CharacteristicDescriptionBenefit
In-processRuns in application processNo separate database server
ColumnarColumn-oriented storageFast analytical queries
VectorizedProcess data in batchesCPU efficiency
Zero-copyNo data serializationFast integration with Pandas/Arrow
SQLANSI SQL compliantFamiliar query language

Getting Started

Installation

# Python
pip install duckdb
# R
install.packages("duckdb")
# CLI
# Download from https://duckdb.org/

Basic Usage

import duckdb
# In-memory database
con = duckdb.connect(database=':memory:')
# Persistent database
con = duckdb.connect(database='local.duckdb')
# Execute SQL
result = con.execute("SELECT 1").fetchall()
print(result) # [(1,)]
# Create table
con.execute("""
CREATE TABLE users (
id INTEGER PRIMARY KEY,
name VARCHAR,
email VARCHAR,
created_at TIMESTAMP
)
""")
# Insert data
con.execute("""
INSERT INTO users VALUES
(1, 'Alice', 'alice@example.com', '2025-01-27 10:00:00'),
(2, 'Bob', 'bob@example.com', '2025-01-27 11:00:00')
""")
# Query data
result = con.execute("SELECT * FROM users").fetchdf()
print(result)

Data Import/Export

Reading Data

# Read CSV
con.execute("""
CREATE TABLE events AS
SELECT * FROM read_csv('events.csv')
""")
# Read Parquet
con.execute("""
CREATE TABLE events AS
SELECT * FROM read_parquet('events.parquet')
""")
# Read JSON (auto-detect schema)
con.execute("""
CREATE TABLE events AS
SELECT * FROM read_json('events.json')
""")
# Read multiple files (glob pattern)
con.execute("""
CREATE TABLE events AS
SELECT * FROM read_parquet('s3://bucket/events/*.parquet')
""")
# Read Pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con.execute("CREATE TABLE df AS SELECT * FROM df")
# Read Polars DataFrame
import polars as pl
df_pl = pl.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con.execute("CREATE TABLE df_pl AS SELECT * FROM df_pl")

Exporting Data

# Export to Parquet
con.execute("""
COPY users TO 'users.parquet' (FORMAT PARQUET)
""")
# Export to CSV
con.execute("""
COPY users TO 'users.csv' (FORMAT CSV, HEADER)
""")
# Export to Pandas
df = con.execute("SELECT * FROM users").fetchdf()
# Export to Arrow
import pyarrow as pa
table = con.execute("SELECT * FROM users").arrow()

SQL Features

Window Functions

# Row number
con.execute("""
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) as rank
FROM events
""").fetchdf()
# Moving average
con.execute("""
SELECT
date,
revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7day
FROM daily_revenue
""").fetchdf()

Arrays and Structs

# Array operations
con.execute("""
SELECT
user_id,
array_length(event_types) as event_count,
array_distinct(event_types) as unique_events
FROM user_events
""").fetchdf()
# Struct operations
con.execute("""
SELECT
user_id,
address.city,
address.zipcode
FROM users
""").fetchdf()

CTEs

con.execute("""
WITH user_stats AS (
SELECT
user_id,
COUNT(*) as event_count,
SUM(amount) as total_amount
FROM events
GROUP BY user_id
)
SELECT
u.user_id,
u.name,
s.event_count,
s.total_amount
FROM users u
JOIN user_stats s ON u.user_id = s.user_id
WHERE s.event_count > 100
""").fetchdf()

Performance Optimization

Parallel Execution

# Set worker threads (default: CPU count)
con.execute("SET threads = 8")
# Enable parallel processing
con.execute("SET max_memory = '8GB'")

Indexing

# Create index (for point lookups)
con.execute("CREATE INDEX idx_users_email ON users(email)")
# Create ART index (for range queries)
con.execute("CREATE INDEX idx_users_created ON users USING ART (created_at)")

Pragmas (Configuration)

# Memory limit
con.execute("SET memory_limit='8GB'")
# Temp directory
con.execute("SET temp_directory='./temp'")
# Vector length (affects performance)
con.execute("SET vector_size=2048") # Default: 2048

Pandas Integration

Zero-Copy Integration

import pandas as pd
import duckdb
# Create Pandas DataFrame
df = pd.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': ['x', 'y', 'z', 'w', 'v'],
'c': [1.1, 2.2, 3.3, 4.4, 5.5]
})
# Query Pandas DataFrame (zero-copy)
result = duckdb.query("SELECT a, c FROM df WHERE a > 2").to_df()
# Aggregate Pandas DataFrame
result = duckdb.query("""
SELECT
b,
COUNT(*) as count,
AVG(c) as avg_c
FROM df
GROUP BY b
""").to_df()
# Join multiple DataFrames
df1 = pd.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
result = duckdb.query("""
SELECT df1.key, df1.value1, df2.value2
FROM df1
LEFT JOIN df2 ON df1.key = df2.key
""").to_df()

Replacing Pandas Operations

# Pandas: Slow for large data
# df.groupby('category').agg({'amount': ['sum', 'mean', 'count']})
# DuckDB: Fast for large data
result = duckdb.query("""
SELECT
category,
SUM(amount) as total,
AVG(amount) as average,
COUNT(*) as count
FROM df
GROUP BY category
""").to_df()

Polars Integration

import polars as pl
import duckdb
# Create Polars DataFrame
df_pl = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': ['x', 'y', 'z', 'w', 'v'],
'c': [1.1, 2.2, 3.3, 4.4, 5.5]
})
# Query Polars DataFrame (zero-copy via Arrow)
result = duckdb.query("SELECT a, c FROM df_pl WHERE a > 2").pl()
# Convert back to Polars
df_result = result.to_df()

Working with Cloud Storage

S3 Integration

# Configure S3
con.execute("""
INSTALL httpfs;
LOAD httpfs;
""")
# Set S3 credentials
con.execute("SET s3_region='us-west-2'")
con.execute("SET s3_access_key_id='YOUR_ACCESS_KEY'")
con.execute("SET s3_secret_access_key='YOUR_SECRET_KEY'")
# Read from S3
con.execute("""
CREATE TABLE s3_data AS
SELECT * FROM read_parquet('s3://bucket/path/to/data/*.parquet')
""")
# Query S3 directly
result = con.execute("""
SELECT * FROM read_parquet('s3://bucket/path/to/data/*.parquet')
WHERE date >= '2025-01-01'
""").fetchdf()

GCS Integration

# Configure GCS
con.execute("""
INSTALL httpfs;
LOAD httpfs;
""")
# Read from GCS
con.execute("""
CREATE TABLE gcs_data AS
SELECT * FROM read_parquet('gs://bucket/path/to/data/*.parquet')
""")

Advanced Features

# Create FTS index
con.execute("INSTALL fts; LOAD fts;")
con.execute("""
PRAGMA create_fts_index('articles', 'content');
""")
# Search
result = con.execute("""
SELECT *
FROM articles
WHERE articles MATCH 'search query'
""").fetchdf()

Spatial Support

# Install spatial extension
con.execute("INSTALL spatial; LOAD spatial;")
# Load spatial data
con.execute("""
CREATE TABLE places AS
SELECT * FROM ST_Read('places.geojson')
""")
# Spatial query
result = con.execute("""
SELECT *
FROM places
WHERE ST_Within(geom, ST_GeomFromText('POLYGON((...))'))
""").fetchdf()

Use Cases

Use Case 1: Notebooks

# Jupyter/Colab notebook
import duckdb
import pandas as pd
# Read data from multiple sources
df_csv = pd.read_csv('data.csv')
con.execute("CREATE TABLE csv_data AS SELECT * FROM df_csv")
con.execute("CREATE TABLE parquet_data AS SELECT * FROM read_parquet('data.parquet')")
# Join and analyze
result = con.execute("""
SELECT
csv_data.id,
csv_data.value,
parquet_data.metric
FROM csv_data
LEFT JOIN parquet_data ON csv_data.id = parquet_data.id
""").to_df()
# Visualize
result.plot(kind='bar', x='id', y='value')

Use Case 2: Data Validation

# Validate data before loading to warehouse
import duckdb
con.execute("""
-- Check for nulls
SELECT
COUNT(*) as total_rows,
SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) as null_ids,
SUM(CASE WHEN amount < 0 THEN 1 ELSE 0 END) as negative_amounts
FROM staging_data
""").fetchdf()
# Data quality checks
con.execute("""
-- Check referential integrity
SELECT
COUNT(*) as orphaned_records
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL
""").fetchdf()

Use Case 3: Local ML Feature Engineering

# Feature engineering for ML
import duckdb
import pandas as pd
con.execute("""
-- Create features
WITH user_features AS (
SELECT
user_id,
COUNT(*) as total_events,
SUM(amount) as total_amount,
AVG(amount) as avg_amount,
STDDEV(amount) as stddev_amount,
MIN(timestamp) as first_event,
MAX(timestamp) as last_event
FROM events
GROUP BY user_id
),
time_features AS (
SELECT
user_id,
DATE_PART('dayofweek', timestamp) as day_of_week,
DATE_PART('hour', timestamp) as hour
FROM events
)
SELECT * FROM user_features
""").fetchdf()

Senior Level Gotchas

Gotcha 1: Single-Node Limitation

Problem: DuckDB is single-node, limited to one machine’s resources.

Solution: Use DuckDB for < 1TB data, use Spark/Trino for larger.

# Bad: Try to process 10TB in DuckDB
# Result: OOM or extremely slow
# Good: Use DuckDB for sampling/prototyping, Spark for full data

Gotcha 2: No Concurrency

Problem: Single writer, multiple readers (no concurrent writes).

Solution: Design write-once, read-many patterns.

Gotcha 3: Memory Pressure

Problem: DuckDB can use all available memory.

Solution: Set memory limit.

con.execute("SET memory_limit='4GB'")

Best Practices

DO

  • Use DuckDB for local data analysis (< 1TB)
  • Use DuckDB in notebooks for fast iteration
  • Use DuckDB for data validation and QA
  • Use DuckDB with Pandas/Polars for zero-copy
  • Use DuckDB for prototyping before scaling to Spark

DON’T

  • Use DuckDB for > 1TB data (use Spark/Trino)
  • Use DuckDB for concurrent writes (not supported)
  • Forget to set memory limits (can OOM your machine)
  • Use DuckDB for production workloads at scale (use managed services)

Key Takeaways

  1. In-process SQL: Runs in your application, no separate database
  2. Zero-copy with Pandas: Direct integration, no serialization overhead
  3. Fast for analytics: Columnar, vectorized execution
  4. Single-node: Limited to one machine’s resources
  5. Perfect for notebooks: Fast iteration, familiar SQL
  6. Free and open source: No licensing costs
  7. Not for production at scale: Use Spark/Trino for large data

Back to Compute Engines