Skip to content

Trino Guide

Distributed SQL Engine for Interactive Analytics


Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytics. Unlike Spark, Trino is SQL-only and optimized for low-latency queries. It excels at federated queries across multiple data sources and BI workloads.


Core Concepts

Trino Architecture

Key Components

ComponentDescriptionRole
CoordinatorQuery entry pointParses, plans, schedules queries
WorkerProcessing nodeExecutes tasks, scans data
ConnectorData source integrationHive, Iceberg, Delta, MySQL, etc.
CatalogNamespace organizationOrganizes data sources
Discovery ServiceWorker registrationTracks active workers

Connector Ecosystem

Major Connectors

Configuration Examples

-- Iceberg catalog (recommended)
CREATE CATALOG iceberg USING iceberg
PROPERTIES (
"iceberg.catalog.type" = "hive",
"hive.metastore.uris" = "thrift://hive-metastore:9083"
);
-- Delta catalog
CREATE CATALOG delta USING delta
PROPERTIES (
"delta.catalog.type" = "hive",
"hive.metastore.uris" = "thrift://hive-metastore:9083"
);
-- PostgreSQL catalog
CREATE CATALOG postgres USING postgresql
PROPERTIES (
"connection-url" = "jdbc:postgresql://postgres:5432/database",
"connection-user" = "user",
"connection-password" = "password"
);
-- BigQuery catalog
CREATE CATALOG bigquery USING bigquery
PROPERTIES (
"project-id" = "my-project",
"credentials-file" = "/path/to/credentials.json"
);

Query Optimization

Predicate Pushdown

-- Predicate pushed to storage layer
SELECT customer_id, sum(amount)
FROM orders
WHERE order_date >= '2025-01-01'
AND customer_id IN (1, 2, 3, 4, 5)
GROUP BY customer_id;
-- Pushed predicates:
-- - order_date filter → partition pruning
-- - customer_id IN → file statistics check
-- Result: 10-100x I/O reduction

What gets pushed:

  • Filters (WHERE)
  • Projections (SELECT columns)
  • Limits (LIMIT)
  • Aggregations (SUM, COUNT, MIN, MAX)

Column pruning

-- Only required columns read
SELECT customer_id, order_total
FROM orders;
-- Instead of: SELECT * (reads all columns)
-- Benefit: 5-50x I/O reduction for wide tables

Join Optimization

Broadcast join threshold:

-- Default: 10MB
SET session join_distribution_type = 'AUTOMATIC';
SET session join_max_broadcast_size = '100MB';
-- Manual hint
SELECT /*+ BROADCAST(small_table) */
large_table.key,
large_table.value,
small_table.attribute
FROM large_table
JOIN small_table ON large_table.key = small_table.key;

Cost-Based Optimization (CBO)

-- Enable CBO (required for table statistics)
SET session use_statistics = true;
-- Collect statistics
ANALYZE TABLE orders UPDATE STATISTICS;
-- View statistics
SHOW STATS FOR orders;
-- CBO improves:
-- - Join order
-- - Join strategy
-- - Aggregate pushdown
-- - Predicate reordering

Performance Tuning

Memory Configuration

-- Query memory (per query)
SET session query_max_memory = '50GB';
SET session query_max_memory_per_node = '10GB';
-- Task memory (per task)
SET session task_max_memory = '2GB';
-- Spill threshold
SET session spill_enabled = true;
SET session spiller_spill_path = '/tmp/trino-spill';

Concurrency Tuning

-- Per-node concurrency
SET session.task_concurrency = 16; -- Default: 16
-- Query queueing
SET session.query_queue_concurrency = 4; -- Concurrent queries per user

Worker Sizing

ConfigurationSmall ClusterMedium ClusterLarge Cluster
Workers5-1010-5050-500
Memory per worker16-32GB32-64GB64-256GB
Cores per worker8-1616-3232-64

SQL Features

Advanced SQL Support

-- Window functions
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as order_rank,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM orders;
-- CTEs
WITH customer_totals AS (
SELECT customer_id, SUM(amount) as total_spent
FROM orders
GROUP BY customer_id
)
SELECT ct.customer_id, ct.total_spent, c.customer_name
FROM customer_totals ct
JOIN customers c ON ct.customer_id = c.customer_id
WHERE ct.total_spent > 1000;
-- Arrays and Maps
SELECT
customer_id,
ARRAY_LENGTH(order_ids) as order_count,
ELEMENT_AT(order_ids, 1) as first_order_id,
map_get(customer_attributes, 'tier', 'unknown') as customer_tier
FROM customer_orders;

Table Functions

-- UNNEST arrays
SELECT
customer_id,
order_id
FROM customers,
UNNEST(order_ids) AS t(order_id);
-- Lateral joins
SELECT
c.customer_id,
o.order_id,
o.amount
FROM customers c
JOIN LATERAL (
SELECT order_id, amount
FROM orders
WHERE customer_id = c.customer_id
ORDER BY order_date DESC
LIMIT 5
) o ON true;

Federation (Multi-Source Queries)

Cross-Catalog Queries

-- Join data from multiple sources
SELECT
s.user_id,
s.event_count,
c.customer_tier,
r.revenue
FROM spark_db.events s -- Spark/Delta table
JOIN postgres.customers c -- PostgreSQL database
ON s.user_id = c.customer_id
JOIN bigquery.revenue r -- BigQuery warehouse
ON s.user_id = r.user_id
WHERE s.event_date >= '2025-01-01';

Use cases:

  • Data migration validation
  • Cross-platform reporting
  • Hybrid cloud scenarios
  • Data mesh (domain-specific data products)

Federation Performance

Optimization:

  • Filter at source (push down predicates)
  • Minimize data transfer
  • Use materialized views for remote sources

Cost Optimization

Use Spot Instances

Configuration:

  • Deploy Trino on spot instances (60-80% discount)
  • Use auto-scaling for variable workloads
  • Spot interruption handling (automatic in managed services)

Right-Size Workers

# Calculate worker requirements:
# - Query concurrency: 10 concurrent queries
# - Memory per query: 10GB
# - Desired per-node memory: 50GB
# Workers needed = (10 × 10GB) / 50GB = 2 workers minimum
# Add 50% buffer = 3 workers
# For high availability: Use 4-5 workers

Use Result Caching

-- Materialized views (via source system)
CREATE MVIEW customer_daily_summary AS
SELECT
customer_id,
DATE(order_date) as order_date,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
GROUP BY customer_id, DATE(order_date);
-- Query hits cached materialized view
SELECT * FROM customer_daily_summary
WHERE order_date >= '2025-01-01';

Senior Level Gotchas

Gotcha 1: No Transactions

Problem: Trino doesn’t support transactions (BEGIN/COMMIT).

Solution: Use underlying source for transactions, or use OTF (Delta/Iceberg).

-- Bad: Try transaction in Trino
BEGIN; -- ERROR
-- Good: Use OTF for transactional writes
-- (via Spark or other engine)

Gotcha 2: Slow Query on Small Data

Problem: Query overhead dominates for small datasets.

Solution: Use DuckDB for small data (< 100GB), Trino for large.

-- Bad: Trino for 1GB table
SELECT * FROM small_table;
-- Good: DuckDB for small data
-- Trino for large-scale joins

Gotcha 3: Federation Costs

Problem: Cross-region data transfer is expensive.

Solution: Colocate Trino workers with data sources.

-- Bad: Trino in us-east-1 querying data in eu-west-1
-- Cost: $0.02/GB × 1000GB = $20 per query
-- Good: Deploy Trino in each region, query local data

Best Practices

DO

  • Use Trino for interactive BI queries
  • Leverage predicate pushdown (filter early)
  • Use broadcast joins for small tables
  • Collect statistics for CBO
  • Use spot instances for cost savings
  • Monitor query metrics via UI

DON’T

  • Use Trino for complex ETL (use Spark)
  • Query tiny datasets (< 10GB) from Trino
  • Ignore statistics (CBO disabled without them)
  • Cross-region federation without cost consideration
  • Forget to set memory limits (runaway queries)

Key Takeaways

  1. SQL-only engine: No Python/Scala, optimized for SQL
  2. Federation strength: Query multiple sources in single query
  3. Low latency: Designed for interactive queries (seconds)
  4. Connector ecosystem: Broad support for data sources
  5. Spot instances: 60-80% savings for workers
  6. Statistics matter: Enable CBO for better plans
  7. Not for ETL: Use Spark for complex transformations

Back to Compute Engines