Trino Performance Anti-Patterns are among the most critical challenges facing data engineers running Trino (formerly PrestoSQL) — the powerful distributed SQL query engine designed for fast analytic queries against data of any size. Even experienced engineers can inadvertently introduce performance bottlenecks in production environments. In this comprehensive guide, we'll expose the top 10
Trino performance anti-patterns — what they are, why they hurt, and how to fix them — so your production workloads run at peak efficiency. Each of these
Trino performance anti-patterns has been observed in real production deployments.
What Are Trino Performance Anti-Patterns?
Trino performance anti-patterns are query design choices, configuration settings, or architectural decisions that degrade query performance, increase resource consumption, or cause cluster instability. Unlike bugs,
Trino performance anti-patterns often produce "correct" results but at a significant and avoidable cost. Identifying and eliminating these patterns is essential for any team running Trino at scale.
Anti-Pattern #1: Using SELECT * in Production Queries
One of the most common and damaging Trino anti-patterns is using
SELECT * indiscriminately. Trino is typically deployed on top of columnar storage formats like ORC or Parquet, which allow column pruning — reading only the columns you actually need. When you use
SELECT *, Trino must read every column, dramatically increasing I/O and network transfer costs.
Anti-Pattern Example
-- ANTI-PATTERN: Reading all columns unnecessarily
SELECT *
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
AND region = 'APAC';
Recommended Fix
-- BEST PRACTICE: Select only required columns
SELECT
transaction_id,
customer_id,
amount,
currency,
transaction_date
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
AND region = 'APAC';
By specifying columns explicitly, Trino leverages Parquet/ORC column projection, potentially reducing I/O by 60–90% for wide tables. Always audit your queries to ensure every column in the
SELECT list is genuinely needed downstream.
Anti-Pattern #2: Ignoring Partition Pruning
Partition pruning is one of Trino's most powerful optimization mechanisms. When tables are partitioned (e.g., by date or region), Trino can skip entire partitions that don't match the query predicate. The anti-pattern occurs when filter conditions are applied on non-partition columns, or when partition column filters are buried inside functions that prevent pushdown.
Anti-Pattern Example
-- ANTI-PATTERN: Wrapping partition column in a function prevents pruning
SELECT customer_id, amount
FROM hive.sales.transactions
WHERE YEAR(transaction_date) = 2024
AND MONTH(transaction_date) = 6;
Recommended Fix
-- BEST PRACTICE: Use range predicates directly on partition columns
SELECT customer_id, amount
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-06-01'
AND transaction_date < DATE '2024-07-01';
Wrapping partition columns in functions like
YEAR(),
DATE_FORMAT(), or
CAST() can prevent the Trino optimizer from pushing predicates to the storage layer. Always use direct range comparisons on partition columns to enable full partition pruning.
Anti-Pattern #3: Poorly Ordered JOIN Operations
Trino uses cost-based optimization (CBO) to reorder joins, but it depends on accurate table statistics. When statistics are stale or absent, Trino may place a large table on the build side of a hash join — causing excessive memory pressure and potential out-of-memory (OOM) errors. The fundamental rule is: the smaller table should always be on the build (right) side of a join.
Anti-Pattern Example
-- ANTI-PATTERN: Large table on the build side
SELECT
f.order_id,
f.amount,
d.customer_name
FROM hive.sales.fact_orders f -- 10 billion rows (probe side ✓)
JOIN hive.sales.dim_customers d -- 1 million rows (build side ✓)
ON f.customer_id = d.customer_id;
-- This is actually fine IF statistics are current.
-- The problem arises when statistics are stale and Trino misestimates sizes:
-- ANTI-PATTERN: Forcing a large result set as build side with a subquery
SELECT f.order_id, f.amount, sub.segment
FROM (
SELECT customer_id, segment
FROM hive.sales.dim_customers
WHERE country IN ('US', 'CA', 'UK', 'AU', 'DE', 'FR') -- returns 900K rows
) sub
JOIN hive.sales.fact_orders f ON sub.customer_id = f.customer_id;
Recommended Fix
-- BEST PRACTICE: Keep small lookup tables on the build side and update statistics
-- First, analyze tables to keep statistics fresh:
ANALYZE hive.sales.fact_orders;
ANALYZE hive.sales.dim_customers;
-- Use explicit join hints when statistics are unreliable:
SELECT /*+ broadcast(d) */
f.order_id,
f.amount,
d.customer_name
FROM hive.sales.fact_orders f
JOIN hive.sales.dim_customers d
ON f.customer_id = d.customer_id
WHERE d.country IN ('US', 'CA', 'UK');
Anti-Pattern #4: Data Skew in GROUP BY and JOIN Operations
Data skew — where one or a few values account for a disproportionate share of rows — is one of the most insidious Trino performance anti-patterns. A skewed key forces a single worker node to handle far more data than others, creating a bottleneck that stalls the entire query even if all other workers finish quickly.
Diagnosing Data Skew
-- Identify skew: check value distribution for join/group-by keys
SELECT
customer_segment,
COUNT(*) AS row_count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM hive.sales.fact_orders
GROUP BY customer_segment
ORDER BY row_count DESC;
Remediation: Salting Technique
-- BEST PRACTICE: Salt skewed keys to distribute load
-- Step 1: Add a random salt to the large/skewed table
WITH salted_orders AS (
SELECT
order_id,
customer_id || '_' || CAST(FLOOR(RAND() * 8) AS VARCHAR) AS salted_key,
amount
FROM hive.sales.fact_orders
WHERE customer_segment = 'ENTERPRISE' -- heavily skewed segment
),
-- Step 2: Explode the small dimension table to match all salt values
exploded_customers AS (
SELECT
customer_id || '_' || CAST(salt AS VARCHAR) AS salted_key,
customer_name,
segment
FROM hive.sales.dim_customers
CROSS JOIN UNNEST(SEQUENCE(0, 7)) AS t(salt)
)
SELECT
s.order_id,
s.amount,
e.customer_name
FROM salted_orders s
JOIN exploded_customers e ON s.salted_key = e.salted_key;
The salting technique distributes skewed key processing across multiple workers by artificially diversifying the join key. This trades a small increase in data volume for dramatically better parallelism.
Anti-Pattern #5: Running Queries Without Table Statistics
Trino's Cost-Based Optimizer (CBO) relies heavily on table and column statistics to make intelligent decisions about join ordering, join type selection, and aggregation strategies. Operating without current statistics forces the optimizer to fall back on heuristics, which frequently produce suboptimal plans.
-- BEST PRACTICE: Analyze tables after significant data changes
-- Analyze entire table
ANALYZE hive.sales.fact_orders;
-- Analyze with column-level statistics for join/filter columns
ANALYZE hive.sales.fact_orders
WITH (columns = ARRAY['customer_id', 'transaction_date', 'region', 'amount']);
-- Verify statistics are present
SHOW STATS FOR hive.sales.fact_orders;
Run
ANALYZE as part of your data pipeline after each significant load. For partitioned tables, you can analyze individual partitions to reduce the overhead of full-table analysis.
Anti-Pattern #6: Excessive Use of DISTINCT Without Purpose
Using
DISTINCT adds a full deduplication pass to your query plan. Many engineers add it "just to be safe" when they suspect duplicates, without actually verifying whether duplicates exist. This forces Trino to sort and hash all output rows, which is expensive at scale.
Anti-Pattern Example
-- ANTI-PATTERN: Unnecessary DISTINCT causing full deduplication
SELECT DISTINCT
o.customer_id,
o.order_date,
c.customer_name
FROM hive.sales.orders o
JOIN hive.sales.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE '2024-01-01';
-- If orders.customer_id is already unique per order, DISTINCT adds no value
Recommended Fix
-- BEST PRACTICE: Understand your data model; use DISTINCT only when necessary
-- Validate if duplicates actually exist before adding DISTINCT:
SELECT customer_id, order_date, COUNT(*) AS cnt
FROM hive.sales.orders
WHERE order_date >= DATE '2024-01-01'
GROUP BY customer_id, order_date
HAVING COUNT(*) > 1;
-- If no rows returned, DISTINCT is unnecessary
Anti-Pattern #7: Suboptimal Use of Window Functions
Window functions are powerful, but running multiple window functions with different
PARTITION BY and
ORDER BY clauses in a single query forces Trino to shuffle and sort the data multiple times. Each unique window specification creates a separate exchange and sort operation.
Anti-Pattern Example
-- ANTI-PATTERN: Multiple window functions with different partitions
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
AVG(amount) OVER (PARTITION BY region ORDER BY order_date) AS region_avg,
RANK() OVER (PARTITION BY product_category ORDER BY amount DESC) AS category_rank
FROM hive.sales.orders;
-- Creates 3 separate sort + exchange operations
Recommended Fix
-- BEST PRACTICE: Group window functions with the same specification together
-- Pre-aggregate where possible and split into separate CTEs
WITH customer_metrics AS (
SELECT
customer_id,
order_date,
amount,
region,
product_category,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM hive.sales.orders
),
region_metrics AS (
SELECT
customer_id,
order_date,
AVG(amount) OVER (PARTITION BY region ORDER BY order_date) AS region_avg
FROM hive.sales.orders
)
SELECT
cm.customer_id,
cm.order_date,
cm.amount,
cm.running_total,
rm.region_avg
FROM customer_metrics cm
JOIN region_metrics rm
ON cm.customer_id = rm.customer_id
AND cm.order_date = rm.order_date;
Anti-Pattern #8: Not Leveraging Table Format Optimizations (Iceberg/Delta)
Many production Trino deployments still use plain Hive tables without exploiting the advanced features available in modern table formats like
Apache Iceberg or
Delta Lake. These formats offer partition evolution, hidden partitioning, file skipping via metadata, and time travel — all of which can dramatically improve query performance.
-- BEST PRACTICE: Use Iceberg for production tables with Trino
-- Create an Iceberg table with hidden partitioning
CREATE TABLE iceberg.sales.transactions (
transaction_id BIGINT,
customer_id BIGINT,
amount DECIMAL(18, 2),
currency VARCHAR(3),
region VARCHAR(50),
transaction_date DATE
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['region', 'day(transaction_date)'],
sorted_by = ARRAY['customer_id']
);
-- Use Iceberg metadata tables for fast file-level statistics
SELECT
partition,
record_count,
file_count,
total_size
FROM iceberg.sales."transactions$partitions"
WHERE partition.transaction_date >= DATE '2024-06-01';
Iceberg's metadata-driven file skipping means Trino can skip entire data files based on column statistics without reading them — a massive performance advantage over traditional Hive tables.
Anti-Pattern #9: Misconfigured Memory and Spill Settings
Allowing queries to spill to disk without proper configuration is a double-edged sword. While spill prevents OOM failures, excessive spilling dramatically degrades query performance. Conversely, setting memory limits too aggressively causes frequent query failures in production.
# BEST PRACTICE: Balanced Trino memory configuration (config.properties)
# Coordinator heap
-Xmx16G
# Worker JVM heap (jvm.config)
# -Xmx128G (set to ~80% of available RAM on worker nodes)
# config.properties on workers
query.max-memory=50GB
query.max-memory-per-node=10GB
query.max-total-memory-per-node=12GB
# Enable spill only as a safety net, not a primary strategy
experimental.spill-enabled=true
experimental.spill-order-by=true
experimental.spill-window-operator=true
spiller-spill-path=/data/trino-spill
spiller-max-used-space-threshold=0.7
# Limit concurrent queries to prevent memory contention
query.max-concurrent-queries=50
Monitor your Trino UI at
http://<coordinator>:8080 to track spill statistics. If you see significant spill activity, investigate whether query rewrites or increased memory allocation are more appropriate solutions.
Anti-Pattern #10: Bypassing the Query Result Cache
For dashboards and recurring analytical queries that hit the same data repeatedly, not using Trino's result cache (available via connector-level caching or Alluxio-based file caching) is a significant missed optimization. File caching reduces repeated remote storage I/O by orders of magnitude.
# BEST PRACTICE: Enable native file system cache (Trino 400+)
# catalog/hive.properties
hive.cache.enabled=true
hive.cache.location=/data/trino-cache
hive.cache.disk-quota=500GB
hive.cache.ttl=24h
# For Iceberg catalogs
iceberg.file-format=PARQUET
iceberg.cache.enabled=true
Trino Performance Anti-Patterns: Quick Reference Table
| Anti-Pattern |
Impact |
Fix |
| SELECT * |
High I/O, excess network transfer |
Specify required columns only |
| No partition pruning |
Full table scans |
Direct range predicates on partition columns |
| Wrong JOIN order |
OOM, slow hash builds |
ANALYZE tables; use broadcast hints |
| Data skew |
Single-node bottleneck |
Key salting; pre-aggregation |
| Missing statistics |
Poor CBO decisions |
Run ANALYZE regularly |
| Unnecessary DISTINCT |
Expensive deduplication |
Validate need; use GROUP BY |
| Multiple window specs |
Repeated sort + exchange |
Consolidate window functions |
| Plain Hive tables |
No file skipping/metadata |
Migrate to Iceberg/Delta |
| Misconfigured memory |
Excessive spill / OOM |
Tune per-node memory settings |
| No result/file cache |
Repeated remote I/O |
Enable native file cache |
Monitoring Trino Performance Anti-Patterns with Observability Tools
Preventing anti-patterns requires continuous observability. The Trino Web UI (
/ui) provides query plans, stage timelines, and operator statistics. For production environments, integrate Trino's JMX metrics with Prometheus and Grafana. Key metrics to monitor include
trino.execution:QueryManager.RunningQueries, operator input/output row counts for skew detection, and
SpilledBytes per query for memory pressure signals.
For deeper query plan analysis, always use
EXPLAIN ANALYZE on slow queries:
-- Use EXPLAIN ANALYZE to identify bottleneck operators
EXPLAIN ANALYZE
SELECT
region,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
GROUP BY region;
The output will show actual row counts, CPU time, and memory usage per operator — invaluable for pinpointing which anti-pattern is hurting your specific query.
Conclusion: Mastering Trino Performance Anti-Patterns
Avoiding Trino performance anti-patterns requires a combination of query design discipline, proactive statistics management, and thoughtful cluster configuration. The ten anti-patterns covered in this post — from the deceptively simple
SELECT * to the complex challenges of data skew — represent the most impactful issues we see in production Trino deployments at
MinervaDB.
Start by auditing your top 10 most expensive queries using the Trino Web UI, apply
EXPLAIN ANALYZE to each, and look for the patterns described above. Even fixing two or three of these anti-patterns can yield dramatic, measurable improvements in query latency and cluster resource utilization.
Related Resources