Skip to content

Compute Engines

Spark, Trino, DuckDB, ClickHouse - Selection and Usage


Overview

Compute engines are the processing layer in modern data architectures. Selecting the right engine for each workload is critical for performance and cost optimization. This document provides a comparison of major compute engines and selection criteria.


Quick Comparison Matrix

EngineTypeBest ForScaleCostEcosystem
Apache SparkDistributed BatchHeavy ETL, MLTB-PBMediumUniversal
Trino/PrestoDistributed SQLInteractive BITB-PBMediumBroad
DuckDBSingle-NodeLocal analytics, < 100GB< 1TBVery LowGrowing
ClickHouseColumnar DBReal-time analyticsTB-PBLowSpecialized
BigQueryServerlessAd-hoc analyticsTB-PBHighGoogle
SnowflakeServerlessEnterprise analyticsTB-PBHighBroad

Selection Framework


Engine Deep Dives

Apache Spark

Overview: Unified analytics engine for batch processing, ML, and streaming.

Strengths:

  • Unified platform (batch + streaming + ML)
  • Mature ecosystem (100+ contributors)
  • Excellent for complex ETL
  • Strong ML integration (MLflow)
  • Broad language support (Scala, Python, SQL, R, Java)

Weaknesses:

  • High latency (not ideal for interactive queries)
  • Cluster overhead (not efficient for small workloads)
  • Complex tuning (many configuration options)
  • Memory-intensive (requires careful JVM tuning)

Best for:

  • Batch ETL at scale
  • Complex transformations
  • Machine learning pipelines
  • Data processing > 100GB

Cost: $0.50-2.00 per TB processed (with spot instances)

See: Spark Guide


Trino (formerly PrestoSQL)

Overview: Distributed SQL engine for interactive analytics.

Strengths:

  • Low latency (sub-second to seconds)
  • ANSI SQL standard
  • Federation (query multiple sources)
  • No data movement (query in place)
  • Broad connector support

Weaknesses:

  • Not for complex ETL (SQL-only)
  • Smaller ecosystem than Spark
  • Limited stateful operations
  • Less mature for ML

Best for:

  • Interactive BI queries
  • Data federation
  • Ad-hoc analytics
  • SQL-only workloads

Cost: $0.50-1.50 per TB processed

See: Trino Guide


DuckDB

Overview: In-process SQL OLAP database (single-node).

Strengths:

  • Fast (columnar, vectorized)
  • Simple (no cluster, single binary)
  • Portable (runs in notebooks)
  • Free and open source
  • Pandas/Polars integration

Weaknesses:

  • Single-node (scales to 1 machine)
  • Limited to ~1TB data
  • No distributed processing
  • Smaller connector ecosystem

Best for:

  • Local data analysis
  • Notebooks and prototyping
  • Data < 100GB
  • Single-machine workloads

Cost: Free (use your own machine)

See: DuckDB Guide


ClickHouse

Overview: Real-time columnar database for analytics.

Strengths:

  • Extremely fast ingestion
  • Excellent compression
  • Real-time query performance
  • SQL support
  • Efficient for time-series

Weaknesses:

  • Not for complex joins
  • Limited ecosystem
  • Operational complexity
  • Learning curve (different from traditional SQL)

Best for:

  • Real-time analytics
  • Time-series data
  • High-ingestion workloads
  • Event data analysis

Cost: $0.30-1.00 per TB processed

See: ClickHouse Guide


Comparison by Dimension

Performance

ScenarioSparkTrinoDuckDBClickHouse
Full table scanGoodGoodExcellentExcellent
Complex ETLExcellentPoorGoodPoor
Interactive queryPoorExcellentExcellentGood
Streaming ingestGoodPoorN/AExcellent
ML trainingExcellentPoorGoodPoor

Scalability

Data SizeSparkTrinoDuckDBClickHouse
< 10GBOverkillOverkillOptimalOverkill
10-100GBGoodGoodOptimalGood
100GB-10TBExcellentExcellentNot applicableExcellent
> 10TBExcellentExcellentNot applicableGood

Cost Efficiency

WorkloadMost Cost-Effective
Local analysisDuckDB (free)
Batch ETLSpark with spot instances
Interactive BITrino with auto-scaling
Real-time analyticsClickHouse
Occasional queriesServerless (BigQuery)

Architecture Patterns

Pattern 1: Multi-Engine Lakehouse

Cost optimization: Right-size each engine for its workload.

Pattern 2: Tiered Processing

Use case: Match engine to processing requirements.


Selection Decision Tree

For New Workloads

For Migration

FromToWhen
Traditional warehouseTrinoFor BI workloads
Hadoop MapReduceSparkDirect replacement
Pandas on large dataDuckDBWhen data doesn’t fit in memory
Real-time DBClickHouseFor time-series analytics

Cost Comparison

Per-TB Processing Cost

EngineOn-DemandSpot/PreemptibleNotes
Spark (EMR)$1.50$0.45 (70% off)With spot instances
Spark (Databricks)$2.00N/APremium pricing
Trino (EMR)$1.20$0.36 (70% off)With spot instances
DuckDB$0$0Use own machine
ClickHouse$0.80$0.24 (70% off)With spot instances
BigQuery$5.00N/AOn-demand pricing
Snowflake$3.00N/AOn-demand pricing

TCO Comparison

Scenario: 1TB data, 100 queries/day (10GB average scan)

EngineComputeStorageMonthly
DuckDB (local)$0$0$0
Spark (spot)$45$23$68
Trino (spot)$36$23$59
BigQuery$150$23$173

Note: DuckDB is free but limited by local machine. Cloud engines scale infinitely.


Implementation Best Practices

DO

  • Use Spark for heavy ETL
  • Use Trino for interactive BI
  • Use DuckDB for local development
  • Use ClickHouse for real-time analytics
  • Use spot instances for batch workloads
  • Right-size clusters for workload

DON’T

  • Use Spark for interactive queries (too slow)
  • Use DuckDB for > 1TB data (won’t scale)
  • Use warehouse for batch ETL (too expensive)
  • Ignore spot instance savings (60-80% discount)
  • Over-provision clusters (waste money)

Key Takeaways

  1. Spark for ETL: Best for complex batch processing
  2. Trino for BI: Best for interactive SQL queries
  3. DuckDB for local: Best for single-machine analytics
  4. ClickHouse for real-time: Best for streaming analytics
  5. Match engine to workload: Right tool for the job
  6. Use spot instances: 60-80% savings for batch workloads
  7. Multi-engine is normal: Lakehouse supports multiple engines

Individual Engine Guides


Next: Module 2: Computing & Processing