Trino Guide
Distributed SQL Engine for Interactive Analytics
Overview
Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytics. Unlike Spark, Trino is SQL-only and optimized for low-latency queries. It excels at federated queries across multiple data sources and BI workloads.
Core Concepts
Trino Architecture
Key Components
| Component | Description | Role |
|---|---|---|
| Coordinator | Query entry point | Parses, plans, schedules queries |
| Worker | Processing node | Executes tasks, scans data |
| Connector | Data source integration | Hive, Iceberg, Delta, MySQL, etc. |
| Catalog | Namespace organization | Organizes data sources |
| Discovery Service | Worker registration | Tracks active workers |
Connector Ecosystem
Major Connectors
Configuration Examples
-- Iceberg catalog (recommended)CREATE CATALOG iceberg USING icebergPROPERTIES ( "iceberg.catalog.type" = "hive", "hive.metastore.uris" = "thrift://hive-metastore:9083");
-- Delta catalogCREATE CATALOG delta USING deltaPROPERTIES ( "delta.catalog.type" = "hive", "hive.metastore.uris" = "thrift://hive-metastore:9083");
-- PostgreSQL catalogCREATE CATALOG postgres USING postgresqlPROPERTIES ( "connection-url" = "jdbc:postgresql://postgres:5432/database", "connection-user" = "user", "connection-password" = "password");
-- BigQuery catalogCREATE CATALOG bigquery USING bigqueryPROPERTIES ( "project-id" = "my-project", "credentials-file" = "/path/to/credentials.json");Query Optimization
Predicate Pushdown
-- Predicate pushed to storage layerSELECT customer_id, sum(amount)FROM ordersWHERE order_date >= '2025-01-01' AND customer_id IN (1, 2, 3, 4, 5)GROUP BY customer_id;
-- Pushed predicates:-- - order_date filter → partition pruning-- - customer_id IN → file statistics check-- Result: 10-100x I/O reductionWhat gets pushed:
- Filters (WHERE)
- Projections (SELECT columns)
- Limits (LIMIT)
- Aggregations (SUM, COUNT, MIN, MAX)
Column pruning
-- Only required columns readSELECT customer_id, order_totalFROM orders;
-- Instead of: SELECT * (reads all columns)-- Benefit: 5-50x I/O reduction for wide tablesJoin Optimization
Broadcast join threshold:
-- Default: 10MBSET session join_distribution_type = 'AUTOMATIC';SET session join_max_broadcast_size = '100MB';
-- Manual hintSELECT /*+ BROADCAST(small_table) */ large_table.key, large_table.value, small_table.attributeFROM large_tableJOIN small_table ON large_table.key = small_table.key;Cost-Based Optimization (CBO)
-- Enable CBO (required for table statistics)SET session use_statistics = true;
-- Collect statisticsANALYZE TABLE orders UPDATE STATISTICS;
-- View statisticsSHOW STATS FOR orders;
-- CBO improves:-- - Join order-- - Join strategy-- - Aggregate pushdown-- - Predicate reorderingPerformance Tuning
Memory Configuration
-- Query memory (per query)SET session query_max_memory = '50GB';SET session query_max_memory_per_node = '10GB';
-- Task memory (per task)SET session task_max_memory = '2GB';
-- Spill thresholdSET session spill_enabled = true;SET session spiller_spill_path = '/tmp/trino-spill';Concurrency Tuning
-- Per-node concurrencySET session.task_concurrency = 16; -- Default: 16
-- Query queueingSET session.query_queue_concurrency = 4; -- Concurrent queries per userWorker Sizing
| Configuration | Small Cluster | Medium Cluster | Large Cluster |
|---|---|---|---|
| Workers | 5-10 | 10-50 | 50-500 |
| Memory per worker | 16-32GB | 32-64GB | 64-256GB |
| Cores per worker | 8-16 | 16-32 | 32-64 |
SQL Features
Advanced SQL Support
-- Window functionsSELECT customer_id, order_date, amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as order_rank, SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_totalFROM orders;
-- CTEsWITH customer_totals AS ( SELECT customer_id, SUM(amount) as total_spent FROM orders GROUP BY customer_id)SELECT ct.customer_id, ct.total_spent, c.customer_nameFROM customer_totals ctJOIN customers c ON ct.customer_id = c.customer_idWHERE ct.total_spent > 1000;
-- Arrays and MapsSELECT customer_id, ARRAY_LENGTH(order_ids) as order_count, ELEMENT_AT(order_ids, 1) as first_order_id, map_get(customer_attributes, 'tier', 'unknown') as customer_tierFROM customer_orders;Table Functions
-- UNNEST arraysSELECT customer_id, order_idFROM customers,UNNEST(order_ids) AS t(order_id);
-- Lateral joinsSELECT c.customer_id, o.order_id, o.amountFROM customers cJOIN LATERAL ( SELECT order_id, amount FROM orders WHERE customer_id = c.customer_id ORDER BY order_date DESC LIMIT 5) o ON true;Federation (Multi-Source Queries)
Cross-Catalog Queries
-- Join data from multiple sourcesSELECT s.user_id, s.event_count, c.customer_tier, r.revenueFROM spark_db.events s -- Spark/Delta tableJOIN postgres.customers c -- PostgreSQL database ON s.user_id = c.customer_idJOIN bigquery.revenue r -- BigQuery warehouse ON s.user_id = r.user_idWHERE s.event_date >= '2025-01-01';Use cases:
- Data migration validation
- Cross-platform reporting
- Hybrid cloud scenarios
- Data mesh (domain-specific data products)
Federation Performance
Optimization:
- Filter at source (push down predicates)
- Minimize data transfer
- Use materialized views for remote sources
Cost Optimization
Use Spot Instances
Configuration:
- Deploy Trino on spot instances (60-80% discount)
- Use auto-scaling for variable workloads
- Spot interruption handling (automatic in managed services)
Right-Size Workers
# Calculate worker requirements:# - Query concurrency: 10 concurrent queries# - Memory per query: 10GB# - Desired per-node memory: 50GB
# Workers needed = (10 × 10GB) / 50GB = 2 workers minimum# Add 50% buffer = 3 workers# For high availability: Use 4-5 workersUse Result Caching
-- Materialized views (via source system)CREATE MVIEW customer_daily_summary ASSELECT customer_id, DATE(order_date) as order_date, COUNT(*) as order_count, SUM(amount) as total_amountFROM ordersGROUP BY customer_id, DATE(order_date);
-- Query hits cached materialized viewSELECT * FROM customer_daily_summaryWHERE order_date >= '2025-01-01';Senior Level Gotchas
Gotcha 1: No Transactions
Problem: Trino doesn’t support transactions (BEGIN/COMMIT).
Solution: Use underlying source for transactions, or use OTF (Delta/Iceberg).
-- Bad: Try transaction in TrinoBEGIN; -- ERROR
-- Good: Use OTF for transactional writes-- (via Spark or other engine)Gotcha 2: Slow Query on Small Data
Problem: Query overhead dominates for small datasets.
Solution: Use DuckDB for small data (< 100GB), Trino for large.
-- Bad: Trino for 1GB tableSELECT * FROM small_table;
-- Good: DuckDB for small data-- Trino for large-scale joinsGotcha 3: Federation Costs
Problem: Cross-region data transfer is expensive.
Solution: Colocate Trino workers with data sources.
-- Bad: Trino in us-east-1 querying data in eu-west-1-- Cost: $0.02/GB × 1000GB = $20 per query
-- Good: Deploy Trino in each region, query local dataBest Practices
DO
- Use Trino for interactive BI queries
- Leverage predicate pushdown (filter early)
- Use broadcast joins for small tables
- Collect statistics for CBO
- Use spot instances for cost savings
- Monitor query metrics via UI
DON’T
- Use Trino for complex ETL (use Spark)
- Query tiny datasets (< 10GB) from Trino
- Ignore statistics (CBO disabled without them)
- Cross-region federation without cost consideration
- Forget to set memory limits (runaway queries)
Key Takeaways
- SQL-only engine: No Python/Scala, optimized for SQL
- Federation strength: Query multiple sources in single query
- Low latency: Designed for interactive queries (seconds)
- Connector ecosystem: Broad support for data sources
- Spot instances: 60-80% savings for workers
- Statistics matter: Enable CBO for better plans
- Not for ETL: Use Spark for complex transformations
Back to Compute Engines