Compute Engines
Spark, Trino, DuckDB, ClickHouse - Selection and Usage
Overview
Compute engines are the processing layer in modern data architectures. Selecting the right engine for each workload is critical for performance and cost optimization. This document provides a comparison of major compute engines and selection criteria.
Quick Comparison Matrix
| Engine | Type | Best For | Scale | Cost | Ecosystem |
|---|---|---|---|---|---|
| Apache Spark | Distributed Batch | Heavy ETL, ML | TB-PB | Medium | Universal |
| Trino/Presto | Distributed SQL | Interactive BI | TB-PB | Medium | Broad |
| DuckDB | Single-Node | Local analytics, < 100GB | < 1TB | Very Low | Growing |
| ClickHouse | Columnar DB | Real-time analytics | TB-PB | Low | Specialized |
| BigQuery | Serverless | Ad-hoc analytics | TB-PB | High | |
| Snowflake | Serverless | Enterprise analytics | TB-PB | High | Broad |
Selection Framework
Engine Deep Dives
Apache Spark
Overview: Unified analytics engine for batch processing, ML, and streaming.
Strengths:
- Unified platform (batch + streaming + ML)
- Mature ecosystem (100+ contributors)
- Excellent for complex ETL
- Strong ML integration (MLflow)
- Broad language support (Scala, Python, SQL, R, Java)
Weaknesses:
- High latency (not ideal for interactive queries)
- Cluster overhead (not efficient for small workloads)
- Complex tuning (many configuration options)
- Memory-intensive (requires careful JVM tuning)
Best for:
- Batch ETL at scale
- Complex transformations
- Machine learning pipelines
- Data processing > 100GB
Cost: $0.50-2.00 per TB processed (with spot instances)
See: Spark Guide
Trino (formerly PrestoSQL)
Overview: Distributed SQL engine for interactive analytics.
Strengths:
- Low latency (sub-second to seconds)
- ANSI SQL standard
- Federation (query multiple sources)
- No data movement (query in place)
- Broad connector support
Weaknesses:
- Not for complex ETL (SQL-only)
- Smaller ecosystem than Spark
- Limited stateful operations
- Less mature for ML
Best for:
- Interactive BI queries
- Data federation
- Ad-hoc analytics
- SQL-only workloads
Cost: $0.50-1.50 per TB processed
See: Trino Guide
DuckDB
Overview: In-process SQL OLAP database (single-node).
Strengths:
- Fast (columnar, vectorized)
- Simple (no cluster, single binary)
- Portable (runs in notebooks)
- Free and open source
- Pandas/Polars integration
Weaknesses:
- Single-node (scales to 1 machine)
- Limited to ~1TB data
- No distributed processing
- Smaller connector ecosystem
Best for:
- Local data analysis
- Notebooks and prototyping
- Data < 100GB
- Single-machine workloads
Cost: Free (use your own machine)
See: DuckDB Guide
ClickHouse
Overview: Real-time columnar database for analytics.
Strengths:
- Extremely fast ingestion
- Excellent compression
- Real-time query performance
- SQL support
- Efficient for time-series
Weaknesses:
- Not for complex joins
- Limited ecosystem
- Operational complexity
- Learning curve (different from traditional SQL)
Best for:
- Real-time analytics
- Time-series data
- High-ingestion workloads
- Event data analysis
Cost: $0.30-1.00 per TB processed
See: ClickHouse Guide
Comparison by Dimension
Performance
| Scenario | Spark | Trino | DuckDB | ClickHouse |
|---|---|---|---|---|
| Full table scan | Good | Good | Excellent | Excellent |
| Complex ETL | Excellent | Poor | Good | Poor |
| Interactive query | Poor | Excellent | Excellent | Good |
| Streaming ingest | Good | Poor | N/A | Excellent |
| ML training | Excellent | Poor | Good | Poor |
Scalability
| Data Size | Spark | Trino | DuckDB | ClickHouse |
|---|---|---|---|---|
| < 10GB | Overkill | Overkill | Optimal | Overkill |
| 10-100GB | Good | Good | Optimal | Good |
| 100GB-10TB | Excellent | Excellent | Not applicable | Excellent |
| > 10TB | Excellent | Excellent | Not applicable | Good |
Cost Efficiency
| Workload | Most Cost-Effective |
|---|---|
| Local analysis | DuckDB (free) |
| Batch ETL | Spark with spot instances |
| Interactive BI | Trino with auto-scaling |
| Real-time analytics | ClickHouse |
| Occasional queries | Serverless (BigQuery) |
Architecture Patterns
Pattern 1: Multi-Engine Lakehouse
Cost optimization: Right-size each engine for its workload.
Pattern 2: Tiered Processing
Use case: Match engine to processing requirements.
Selection Decision Tree
For New Workloads
For Migration
| From | To | When |
|---|---|---|
| Traditional warehouse | Trino | For BI workloads |
| Hadoop MapReduce | Spark | Direct replacement |
| Pandas on large data | DuckDB | When data doesn’t fit in memory |
| Real-time DB | ClickHouse | For time-series analytics |
Cost Comparison
Per-TB Processing Cost
| Engine | On-Demand | Spot/Preemptible | Notes |
|---|---|---|---|
| Spark (EMR) | $1.50 | $0.45 (70% off) | With spot instances |
| Spark (Databricks) | $2.00 | N/A | Premium pricing |
| Trino (EMR) | $1.20 | $0.36 (70% off) | With spot instances |
| DuckDB | $0 | $0 | Use own machine |
| ClickHouse | $0.80 | $0.24 (70% off) | With spot instances |
| BigQuery | $5.00 | N/A | On-demand pricing |
| Snowflake | $3.00 | N/A | On-demand pricing |
TCO Comparison
Scenario: 1TB data, 100 queries/day (10GB average scan)
| Engine | Compute | Storage | Monthly |
|---|---|---|---|
| DuckDB (local) | $0 | $0 | $0 |
| Spark (spot) | $45 | $23 | $68 |
| Trino (spot) | $36 | $23 | $59 |
| BigQuery | $150 | $23 | $173 |
Note: DuckDB is free but limited by local machine. Cloud engines scale infinitely.
Implementation Best Practices
DO
- Use Spark for heavy ETL
- Use Trino for interactive BI
- Use DuckDB for local development
- Use ClickHouse for real-time analytics
- Use spot instances for batch workloads
- Right-size clusters for workload
DON’T
- Use Spark for interactive queries (too slow)
- Use DuckDB for > 1TB data (won’t scale)
- Use warehouse for batch ETL (too expensive)
- Ignore spot instance savings (60-80% discount)
- Over-provision clusters (waste money)
Key Takeaways
- Spark for ETL: Best for complex batch processing
- Trino for BI: Best for interactive SQL queries
- DuckDB for local: Best for single-machine analytics
- ClickHouse for real-time: Best for streaming analytics
- Match engine to workload: Right tool for the job
- Use spot instances: 60-80% savings for batch workloads
- Multi-engine is normal: Lakehouse supports multiple engines