DuckDB Guide
In-Process SQL OLAP Database
Overview
DuckDB is an in-process SQL OLAP database, similar to SQLite but for analytics. It’s single-node, columnar, and optimized for analytical queries. DuckDB is ideal for local data analysis, notebooks, and workloads that don’t require distributed processing.
Core Concepts
DuckDB Architecture
Key Characteristics
| Characteristic | Description | Benefit |
|---|---|---|
| In-process | Runs in application process | No separate database server |
| Columnar | Column-oriented storage | Fast analytical queries |
| Vectorized | Process data in batches | CPU efficiency |
| Zero-copy | No data serialization | Fast integration with Pandas/Arrow |
| SQL | ANSI SQL compliant | Familiar query language |
Getting Started
Installation
# Pythonpip install duckdb
# Rinstall.packages("duckdb")
# CLI# Download from https://duckdb.org/Basic Usage
import duckdb
# In-memory databasecon = duckdb.connect(database=':memory:')
# Persistent databasecon = duckdb.connect(database='local.duckdb')
# Execute SQLresult = con.execute("SELECT 1").fetchall()print(result) # [(1,)]
# Create tablecon.execute(""" CREATE TABLE users ( id INTEGER PRIMARY KEY, name VARCHAR, email VARCHAR, created_at TIMESTAMP )""")
# Insert datacon.execute(""" INSERT INTO users VALUES (1, 'Alice', 'alice@example.com', '2025-01-27 10:00:00'), (2, 'Bob', 'bob@example.com', '2025-01-27 11:00:00')""")
# Query dataresult = con.execute("SELECT * FROM users").fetchdf()print(result)Data Import/Export
Reading Data
# Read CSVcon.execute(""" CREATE TABLE events AS SELECT * FROM read_csv('events.csv')""")
# Read Parquetcon.execute(""" CREATE TABLE events AS SELECT * FROM read_parquet('events.parquet')""")
# Read JSON (auto-detect schema)con.execute(""" CREATE TABLE events AS SELECT * FROM read_json('events.json')""")
# Read multiple files (glob pattern)con.execute(""" CREATE TABLE events AS SELECT * FROM read_parquet('s3://bucket/events/*.parquet')""")
# Read Pandas DataFrameimport pandas as pddf = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})con.execute("CREATE TABLE df AS SELECT * FROM df")
# Read Polars DataFrameimport polars as pldf_pl = pl.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})con.execute("CREATE TABLE df_pl AS SELECT * FROM df_pl")Exporting Data
# Export to Parquetcon.execute(""" COPY users TO 'users.parquet' (FORMAT PARQUET)""")
# Export to CSVcon.execute(""" COPY users TO 'users.csv' (FORMAT CSV, HEADER)""")
# Export to Pandasdf = con.execute("SELECT * FROM users").fetchdf()
# Export to Arrowimport pyarrow as patable = con.execute("SELECT * FROM users").arrow()SQL Features
Window Functions
# Row numbercon.execute(""" SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) as rank FROM events""").fetchdf()
# Moving averagecon.execute(""" SELECT date, revenue, AVG(revenue) OVER ( ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) as moving_avg_7day FROM daily_revenue""").fetchdf()Arrays and Structs
# Array operationscon.execute(""" SELECT user_id, array_length(event_types) as event_count, array_distinct(event_types) as unique_events FROM user_events""").fetchdf()
# Struct operationscon.execute(""" SELECT user_id, address.city, address.zipcode FROM users""").fetchdf()CTEs
con.execute(""" WITH user_stats AS ( SELECT user_id, COUNT(*) as event_count, SUM(amount) as total_amount FROM events GROUP BY user_id ) SELECT u.user_id, u.name, s.event_count, s.total_amount FROM users u JOIN user_stats s ON u.user_id = s.user_id WHERE s.event_count > 100""").fetchdf()Performance Optimization
Parallel Execution
# Set worker threads (default: CPU count)con.execute("SET threads = 8")
# Enable parallel processingcon.execute("SET max_memory = '8GB'")Indexing
# Create index (for point lookups)con.execute("CREATE INDEX idx_users_email ON users(email)")
# Create ART index (for range queries)con.execute("CREATE INDEX idx_users_created ON users USING ART (created_at)")Pragmas (Configuration)
# Memory limitcon.execute("SET memory_limit='8GB'")
# Temp directorycon.execute("SET temp_directory='./temp'")
# Vector length (affects performance)con.execute("SET vector_size=2048") # Default: 2048Pandas Integration
Zero-Copy Integration
import pandas as pdimport duckdb
# Create Pandas DataFramedf = pd.DataFrame({ 'a': [1, 2, 3, 4, 5], 'b': ['x', 'y', 'z', 'w', 'v'], 'c': [1.1, 2.2, 3.3, 4.4, 5.5]})
# Query Pandas DataFrame (zero-copy)result = duckdb.query("SELECT a, c FROM df WHERE a > 2").to_df()
# Aggregate Pandas DataFrameresult = duckdb.query(""" SELECT b, COUNT(*) as count, AVG(c) as avg_c FROM df GROUP BY b""").to_df()
# Join multiple DataFramesdf1 = pd.DataFrame({'key': [1, 2, 3], 'value1': ['a', 'b', 'c']})df2 = pd.DataFrame({'key': [1, 2, 4], 'value2': ['x', 'y', 'z']})
result = duckdb.query(""" SELECT df1.key, df1.value1, df2.value2 FROM df1 LEFT JOIN df2 ON df1.key = df2.key""").to_df()Replacing Pandas Operations
# Pandas: Slow for large data# df.groupby('category').agg({'amount': ['sum', 'mean', 'count']})
# DuckDB: Fast for large dataresult = duckdb.query(""" SELECT category, SUM(amount) as total, AVG(amount) as average, COUNT(*) as count FROM df GROUP BY category""").to_df()Polars Integration
import polars as plimport duckdb
# Create Polars DataFramedf_pl = pl.DataFrame({ 'a': [1, 2, 3, 4, 5], 'b': ['x', 'y', 'z', 'w', 'v'], 'c': [1.1, 2.2, 3.3, 4.4, 5.5]})
# Query Polars DataFrame (zero-copy via Arrow)result = duckdb.query("SELECT a, c FROM df_pl WHERE a > 2").pl()
# Convert back to Polarsdf_result = result.to_df()Working with Cloud Storage
S3 Integration
# Configure S3con.execute(""" INSTALL httpfs; LOAD httpfs;""")
# Set S3 credentialscon.execute("SET s3_region='us-west-2'")con.execute("SET s3_access_key_id='YOUR_ACCESS_KEY'")con.execute("SET s3_secret_access_key='YOUR_SECRET_KEY'")
# Read from S3con.execute(""" CREATE TABLE s3_data AS SELECT * FROM read_parquet('s3://bucket/path/to/data/*.parquet')""")
# Query S3 directlyresult = con.execute(""" SELECT * FROM read_parquet('s3://bucket/path/to/data/*.parquet') WHERE date >= '2025-01-01'""").fetchdf()GCS Integration
# Configure GCScon.execute(""" INSTALL httpfs; LOAD httpfs;""")
# Read from GCScon.execute(""" CREATE TABLE gcs_data AS SELECT * FROM read_parquet('gs://bucket/path/to/data/*.parquet')""")Advanced Features
Full-Text Search
# Create FTS indexcon.execute("INSTALL fts; LOAD fts;")
con.execute(""" PRAGMA create_fts_index('articles', 'content');""")
# Searchresult = con.execute(""" SELECT * FROM articles WHERE articles MATCH 'search query'""").fetchdf()Spatial Support
# Install spatial extensioncon.execute("INSTALL spatial; LOAD spatial;")
# Load spatial datacon.execute(""" CREATE TABLE places AS SELECT * FROM ST_Read('places.geojson')""")
# Spatial queryresult = con.execute(""" SELECT * FROM places WHERE ST_Within(geom, ST_GeomFromText('POLYGON((...))'))""").fetchdf()Use Cases
Use Case 1: Notebooks
# Jupyter/Colab notebookimport duckdbimport pandas as pd
# Read data from multiple sourcesdf_csv = pd.read_csv('data.csv')con.execute("CREATE TABLE csv_data AS SELECT * FROM df_csv")
con.execute("CREATE TABLE parquet_data AS SELECT * FROM read_parquet('data.parquet')")
# Join and analyzeresult = con.execute(""" SELECT csv_data.id, csv_data.value, parquet_data.metric FROM csv_data LEFT JOIN parquet_data ON csv_data.id = parquet_data.id""").to_df()
# Visualizeresult.plot(kind='bar', x='id', y='value')Use Case 2: Data Validation
# Validate data before loading to warehouseimport duckdb
con.execute(""" -- Check for nulls SELECT COUNT(*) as total_rows, SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) as null_ids, SUM(CASE WHEN amount < 0 THEN 1 ELSE 0 END) as negative_amounts FROM staging_data""").fetchdf()
# Data quality checkscon.execute(""" -- Check referential integrity SELECT COUNT(*) as orphaned_records FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL""").fetchdf()Use Case 3: Local ML Feature Engineering
# Feature engineering for MLimport duckdbimport pandas as pd
con.execute(""" -- Create features WITH user_features AS ( SELECT user_id, COUNT(*) as total_events, SUM(amount) as total_amount, AVG(amount) as avg_amount, STDDEV(amount) as stddev_amount, MIN(timestamp) as first_event, MAX(timestamp) as last_event FROM events GROUP BY user_id ), time_features AS ( SELECT user_id, DATE_PART('dayofweek', timestamp) as day_of_week, DATE_PART('hour', timestamp) as hour FROM events ) SELECT * FROM user_features""").fetchdf()Senior Level Gotchas
Gotcha 1: Single-Node Limitation
Problem: DuckDB is single-node, limited to one machine’s resources.
Solution: Use DuckDB for < 1TB data, use Spark/Trino for larger.
# Bad: Try to process 10TB in DuckDB# Result: OOM or extremely slow
# Good: Use DuckDB for sampling/prototyping, Spark for full dataGotcha 2: No Concurrency
Problem: Single writer, multiple readers (no concurrent writes).
Solution: Design write-once, read-many patterns.
Gotcha 3: Memory Pressure
Problem: DuckDB can use all available memory.
Solution: Set memory limit.
con.execute("SET memory_limit='4GB'")Best Practices
DO
- Use DuckDB for local data analysis (< 1TB)
- Use DuckDB in notebooks for fast iteration
- Use DuckDB for data validation and QA
- Use DuckDB with Pandas/Polars for zero-copy
- Use DuckDB for prototyping before scaling to Spark
DON’T
- Use DuckDB for > 1TB data (use Spark/Trino)
- Use DuckDB for concurrent writes (not supported)
- Forget to set memory limits (can OOM your machine)
- Use DuckDB for production workloads at scale (use managed services)
Key Takeaways
- In-process SQL: Runs in your application, no separate database
- Zero-copy with Pandas: Direct integration, no serialization overhead
- Fast for analytics: Columnar, vectorized execution
- Single-node: Limited to one machine’s resources
- Perfect for notebooks: Fast iteration, familiar SQL
- Free and open source: No licensing costs
- Not for production at scale: Use Spark/Trino for large data
Back to Compute Engines