Compute Engines

Spark, Trino, DuckDB, ClickHouse - Selection and Usage

Overview

Compute engines are the processing layer in modern data architectures. Selecting the right engine for each workload is critical for performance and cost optimization. This document provides a comparison of major compute engines and selection criteria.

Quick Comparison Matrix

Engine	Type	Best For	Scale	Cost	Ecosystem
Apache Spark	Distributed Batch	Heavy ETL, ML	TB-PB	Medium	Universal
Trino/Presto	Distributed SQL	Interactive BI	TB-PB	Medium	Broad
DuckDB	Single-Node	Local analytics, < 100GB	< 1TB	Very Low	Growing
ClickHouse	Columnar DB	Real-time analytics	TB-PB	Low	Specialized
BigQuery	Serverless	Ad-hoc analytics	TB-PB	High	Google
Snowflake	Serverless	Enterprise analytics	TB-PB	High	Broad

Selection Framework

Engine Deep Dives

Apache Spark

Overview: Unified analytics engine for batch processing, ML, and streaming.

Strengths:

Unified platform (batch + streaming + ML)
Mature ecosystem (100+ contributors)
Excellent for complex ETL
Strong ML integration (MLflow)
Broad language support (Scala, Python, SQL, R, Java)

Weaknesses:

High latency (not ideal for interactive queries)
Cluster overhead (not efficient for small workloads)
Complex tuning (many configuration options)
Memory-intensive (requires careful JVM tuning)

Best for:

Batch ETL at scale
Complex transformations
Machine learning pipelines
Data processing > 100GB

Cost: $0.50-2.00 per TB processed (with spot instances)

See: Spark Guide

Trino (formerly PrestoSQL)

Overview: Distributed SQL engine for interactive analytics.

Strengths:

Low latency (sub-second to seconds)
ANSI SQL standard
Federation (query multiple sources)
No data movement (query in place)
Broad connector support

Weaknesses:

Not for complex ETL (SQL-only)
Smaller ecosystem than Spark
Limited stateful operations
Less mature for ML

Best for:

Interactive BI queries
Data federation
Ad-hoc analytics
SQL-only workloads

Cost: $0.50-1.50 per TB processed

See: Trino Guide

DuckDB

Overview: In-process SQL OLAP database (single-node).

Strengths:

Fast (columnar, vectorized)
Simple (no cluster, single binary)
Portable (runs in notebooks)
Free and open source
Pandas/Polars integration

Weaknesses:

Single-node (scales to 1 machine)
Limited to ~1TB data
No distributed processing
Smaller connector ecosystem

Best for:

Local data analysis
Notebooks and prototyping
Data < 100GB
Single-machine workloads

Cost: Free (use your own machine)

See: DuckDB Guide

ClickHouse

Overview: Real-time columnar database for analytics.

Strengths:

Extremely fast ingestion
Excellent compression
Real-time query performance
SQL support
Efficient for time-series

Weaknesses:

Not for complex joins
Limited ecosystem
Operational complexity
Learning curve (different from traditional SQL)

Best for:

Real-time analytics
Time-series data
High-ingestion workloads
Event data analysis

Cost: $0.30-1.00 per TB processed

See: ClickHouse Guide

Comparison by Dimension

Performance

Scenario	Spark	Trino	DuckDB	ClickHouse
Full table scan	Good	Good	Excellent	Excellent
Complex ETL	Excellent	Poor	Good	Poor
Interactive query	Poor	Excellent	Excellent	Good
Streaming ingest	Good	Poor	N/A	Excellent
ML training	Excellent	Poor	Good	Poor

Scalability

Data Size	Spark	Trino	DuckDB	ClickHouse
< 10GB	Overkill	Overkill	Optimal	Overkill
10-100GB	Good	Good	Optimal	Good
100GB-10TB	Excellent	Excellent	Not applicable	Excellent
> 10TB	Excellent	Excellent	Not applicable	Good

Cost Efficiency

Workload	Most Cost-Effective
Local analysis	DuckDB (free)
Batch ETL	Spark with spot instances
Interactive BI	Trino with auto-scaling
Real-time analytics	ClickHouse
Occasional queries	Serverless (BigQuery)

Architecture Patterns

Pattern 1: Multi-Engine Lakehouse

Cost optimization: Right-size each engine for its workload.

Pattern 2: Tiered Processing

Use case: Match engine to processing requirements.

Selection Decision Tree

For New Workloads

For Migration

From	To	When
Traditional warehouse	Trino	For BI workloads
Hadoop MapReduce	Spark	Direct replacement
Pandas on large data	DuckDB	When data doesn’t fit in memory
Real-time DB	ClickHouse	For time-series analytics

Cost Comparison

Per-TB Processing Cost

Engine	On-Demand	Spot/Preemptible	Notes
Spark (EMR)	$1.50	$0.45 (70% off)	With spot instances
Spark (Databricks)	$2.00	N/A	Premium pricing
Trino (EMR)	$1.20	$0.36 (70% off)	With spot instances
DuckDB	$0	$0	Use own machine
ClickHouse	$0.80	$0.24 (70% off)	With spot instances
BigQuery	$5.00	N/A	On-demand pricing
Snowflake	$3.00	N/A	On-demand pricing

TCO Comparison

Scenario: 1TB data, 100 queries/day (10GB average scan)

Engine	Compute	Storage	Monthly
DuckDB (local)	$0	$0	$0
Spark (spot)	$45	$23	$68
Trino (spot)	$36	$23	$59
BigQuery	$150	$23	$173

Note: DuckDB is free but limited by local machine. Cloud engines scale infinitely.

Implementation Best Practices

DO

Use Spark for heavy ETL
Use Trino for interactive BI
Use DuckDB for local development
Use ClickHouse for real-time analytics
Use spot instances for batch workloads
Right-size clusters for workload

DON’T

Use Spark for interactive queries (too slow)
Use DuckDB for > 1TB data (won’t scale)
Use warehouse for batch ETL (too expensive)
Ignore spot instance savings (60-80% discount)
Over-provision clusters (waste money)

Key Takeaways

Spark for ETL: Best for complex batch processing
Trino for BI: Best for interactive SQL queries
DuckDB for local: Best for single-machine analytics
ClickHouse for real-time: Best for streaming analytics
Match engine to workload: Right tool for the job
Use spot instances: 60-80% savings for batch workloads
Multi-engine is normal: Lakehouse supports multiple engines

Individual Engine Guides

Next: Module 2: Computing & Processing