Trino Guide

Distributed SQL Engine for Interactive Analytics

Overview

Trino (formerly PrestoSQL) is a distributed SQL query engine designed for fast, interactive analytics. Unlike Spark, Trino is SQL-only and optimized for low-latency queries. It excels at federated queries across multiple data sources and BI workloads.

Core Concepts

Trino Architecture

Key Components

Component	Description	Role
Coordinator	Query entry point	Parses, plans, schedules queries
Worker	Processing node	Executes tasks, scans data
Connector	Data source integration	Hive, Iceberg, Delta, MySQL, etc.
Catalog	Namespace organization	Organizes data sources
Discovery Service	Worker registration	Tracks active workers

Connector Ecosystem

Major Connectors

Configuration Examples

-- Iceberg catalog (recommended)
CREATE CATALOG iceberg USING iceberg
PROPERTIES (
    "iceberg.catalog.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083"
);

-- Delta catalog
CREATE CATALOG delta USING delta
PROPERTIES (
    "delta.catalog.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083"
);

-- PostgreSQL catalog
CREATE CATALOG postgres USING postgresql
PROPERTIES (
    "connection-url" = "jdbc:postgresql://postgres:5432/database",
    "connection-user" = "user",
    "connection-password" = "password"
);

-- BigQuery catalog
CREATE CATALOG bigquery USING bigquery
PROPERTIES (
    "project-id" = "my-project",
    "credentials-file" = "/path/to/credentials.json"
);

Query Optimization

Predicate Pushdown

-- Predicate pushed to storage layer
SELECT customer_id, sum(amount)
FROM orders
WHERE order_date >= '2025-01-01'
  AND customer_id IN (1, 2, 3, 4, 5)
GROUP BY customer_id;

-- Pushed predicates:
-- - order_date filter → partition pruning
-- - customer_id IN → file statistics check
-- Result: 10-100x I/O reduction

What gets pushed:

Filters (WHERE)
Projections (SELECT columns)
Limits (LIMIT)
Aggregations (SUM, COUNT, MIN, MAX)

Column pruning

-- Only required columns read
SELECT customer_id, order_total
FROM orders;

-- Instead of: SELECT * (reads all columns)
-- Benefit: 5-50x I/O reduction for wide tables

Join Optimization

Broadcast join threshold:

-- Default: 10MB
SET session join_distribution_type = 'AUTOMATIC';
SET session join_max_broadcast_size = '100MB';

-- Manual hint
SELECT /*+ BROADCAST(small_table) */
    large_table.key,
    large_table.value,
    small_table.attribute
FROM large_table
JOIN small_table ON large_table.key = small_table.key;

Cost-Based Optimization (CBO)

-- Enable CBO (required for table statistics)
SET session use_statistics = true;

-- Collect statistics
ANALYZE TABLE orders UPDATE STATISTICS;

-- View statistics
SHOW STATS FOR orders;

-- CBO improves:
-- - Join order
-- - Join strategy
-- - Aggregate pushdown
-- - Predicate reordering

Performance Tuning

Memory Configuration

-- Query memory (per query)
SET session query_max_memory = '50GB';
SET session query_max_memory_per_node = '10GB';

-- Task memory (per task)
SET session task_max_memory = '2GB';

-- Spill threshold
SET session spill_enabled = true;
SET session spiller_spill_path = '/tmp/trino-spill';

Concurrency Tuning

-- Per-node concurrency
SET session.task_concurrency = 16;  -- Default: 16

-- Query queueing
SET session.query_queue_concurrency = 4;  -- Concurrent queries per user

Worker Sizing

Configuration	Small Cluster	Medium Cluster	Large Cluster
Workers	5-10	10-50	50-500
Memory per worker	16-32GB	32-64GB	64-256GB
Cores per worker	8-16	16-32	32-64

SQL Features

Advanced SQL Support

-- Window functions
SELECT
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as order_rank,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM orders;

-- CTEs
WITH customer_totals AS (
    SELECT customer_id, SUM(amount) as total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT ct.customer_id, ct.total_spent, c.customer_name
FROM customer_totals ct
JOIN customers c ON ct.customer_id = c.customer_id
WHERE ct.total_spent > 1000;

-- Arrays and Maps
SELECT
    customer_id,
    ARRAY_LENGTH(order_ids) as order_count,
    ELEMENT_AT(order_ids, 1) as first_order_id,
    map_get(customer_attributes, 'tier', 'unknown') as customer_tier
FROM customer_orders;

Table Functions

-- UNNEST arrays
SELECT
    customer_id,
    order_id
FROM customers,
UNNEST(order_ids) AS t(order_id);

-- Lateral joins
SELECT
    c.customer_id,
    o.order_id,
    o.amount
FROM customers c
JOIN LATERAL (
    SELECT order_id, amount
    FROM orders
    WHERE customer_id = c.customer_id
    ORDER BY order_date DESC
    LIMIT 5
) o ON true;

Federation (Multi-Source Queries)

Cross-Catalog Queries

-- Join data from multiple sources
SELECT
    s.user_id,
    s.event_count,
    c.customer_tier,
    r.revenue
FROM spark_db.events s           -- Spark/Delta table
JOIN postgres.customers c        -- PostgreSQL database
  ON s.user_id = c.customer_id
JOIN bigquery.revenue r          -- BigQuery warehouse
  ON s.user_id = r.user_id
WHERE s.event_date >= '2025-01-01';

Use cases:

Data migration validation
Cross-platform reporting
Hybrid cloud scenarios
Data mesh (domain-specific data products)

Federation Performance

Optimization:

Filter at source (push down predicates)
Minimize data transfer
Use materialized views for remote sources

Cost Optimization

Use Spot Instances

Configuration:

Deploy Trino on spot instances (60-80% discount)
Use auto-scaling for variable workloads
Spot interruption handling (automatic in managed services)

Right-Size Workers

# Calculate worker requirements:
# - Query concurrency: 10 concurrent queries
# - Memory per query: 10GB
# - Desired per-node memory: 50GB

# Workers needed = (10 × 10GB) / 50GB = 2 workers minimum
# Add 50% buffer = 3 workers
# For high availability: Use 4-5 workers

Use Result Caching

-- Materialized views (via source system)
CREATE MVIEW customer_daily_summary AS
SELECT
    customer_id,
    DATE(order_date) as order_date,
    COUNT(*) as order_count,
    SUM(amount) as total_amount
FROM orders
GROUP BY customer_id, DATE(order_date);

-- Query hits cached materialized view
SELECT * FROM customer_daily_summary
WHERE order_date >= '2025-01-01';

Senior Level Gotchas

Gotcha 1: No Transactions

Problem: Trino doesn’t support transactions (BEGIN/COMMIT).

Solution: Use underlying source for transactions, or use OTF (Delta/Iceberg).

-- Bad: Try transaction in Trino
BEGIN;  -- ERROR

-- Good: Use OTF for transactional writes
-- (via Spark or other engine)

Gotcha 2: Slow Query on Small Data

Problem: Query overhead dominates for small datasets.

Solution: Use DuckDB for small data (< 100GB), Trino for large.

-- Bad: Trino for 1GB table
SELECT * FROM small_table;

-- Good: DuckDB for small data
-- Trino for large-scale joins

Gotcha 3: Federation Costs

Problem: Cross-region data transfer is expensive.

Solution: Colocate Trino workers with data sources.

-- Bad: Trino in us-east-1 querying data in eu-west-1
-- Cost: $0.02/GB × 1000GB = $20 per query

-- Good: Deploy Trino in each region, query local data

Best Practices

DO

Use Trino for interactive BI queries
Leverage predicate pushdown (filter early)
Use broadcast joins for small tables
Collect statistics for CBO
Use spot instances for cost savings
Monitor query metrics via UI

DON’T

Use Trino for complex ETL (use Spark)
Query tiny datasets (< 10GB) from Trino
Ignore statistics (CBO disabled without them)
Cross-region federation without cost consideration
Forget to set memory limits (runaway queries)

Key Takeaways

SQL-only engine: No Python/Scala, optimized for SQL
Federation strength: Query multiple sources in single query
Low latency: Designed for interactive queries (seconds)
Connector ecosystem: Broad support for data sources
Spot instances: 60-80% savings for workers
Statistics matter: Enable CBO for better plans
Not for ETL: Use Spark for complex transformations

Back to Compute Engines