Top 10 Trino Performance Anti-Patterns to Avoid in Production

Trino Performance Anti-Patterns are among the most critical challenges facing data engineers running Trino (formerly PrestoSQL) — the powerful distributed SQL query engine designed for fast analytic queries against data of any size. Even experienced engineers can inadvertently introduce performance bottlenecks in production environments. In this comprehensive guide, we'll expose the top 10 Trino performance anti-patterns — what they are, why they hurt, and how to fix them — so your production workloads run at peak efficiency. Each of these Trino performance anti-patterns has been observed in real production deployments.

What Are Trino Performance Anti-Patterns?

Trino performance anti-patterns are query design choices, configuration settings, or architectural decisions that degrade query performance, increase resource consumption, or cause cluster instability. Unlike bugs, Trino performance anti-patterns often produce "correct" results but at a significant and avoidable cost. Identifying and eliminating these patterns is essential for any team running Trino at scale.

Anti-Pattern #1: Using SELECT * in Production Queries

One of the most common and damaging Trino anti-patterns is using SELECT * indiscriminately. Trino is typically deployed on top of columnar storage formats like ORC or Parquet, which allow column pruning — reading only the columns you actually need. When you use SELECT *, Trino must read every column, dramatically increasing I/O and network transfer costs.

Anti-Pattern Example

-- ANTI-PATTERN: Reading all columns unnecessarily
SELECT *
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
  AND region = 'APAC';

Recommended Fix

--  BEST PRACTICE: Select only required columns
SELECT
    transaction_id,
    customer_id,
    amount,
    currency,
    transaction_date
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
  AND region = 'APAC';
By specifying columns explicitly, Trino leverages Parquet/ORC column projection, potentially reducing I/O by 60–90% for wide tables. Always audit your queries to ensure every column in the SELECT list is genuinely needed downstream.

Anti-Pattern #2: Ignoring Partition Pruning

Partition pruning is one of Trino's most powerful optimization mechanisms. When tables are partitioned (e.g., by date or region), Trino can skip entire partitions that don't match the query predicate. The anti-pattern occurs when filter conditions are applied on non-partition columns, or when partition column filters are buried inside functions that prevent pushdown.

Anti-Pattern Example

--  ANTI-PATTERN: Wrapping partition column in a function prevents pruning
SELECT customer_id, amount
FROM hive.sales.transactions
WHERE YEAR(transaction_date) = 2024
  AND MONTH(transaction_date) = 6;

Recommended Fix

--  BEST PRACTICE: Use range predicates directly on partition columns
SELECT customer_id, amount
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-06-01'
  AND transaction_date < DATE '2024-07-01';
Wrapping partition columns in functions like YEAR(), DATE_FORMAT(), or CAST() can prevent the Trino optimizer from pushing predicates to the storage layer. Always use direct range comparisons on partition columns to enable full partition pruning.

Anti-Pattern #3: Poorly Ordered JOIN Operations

Trino uses cost-based optimization (CBO) to reorder joins, but it depends on accurate table statistics. When statistics are stale or absent, Trino may place a large table on the build side of a hash join — causing excessive memory pressure and potential out-of-memory (OOM) errors. The fundamental rule is: the smaller table should always be on the build (right) side of a join.

Anti-Pattern Example

--  ANTI-PATTERN: Large table on the build side
SELECT
    f.order_id,
    f.amount,
    d.customer_name
FROM hive.sales.fact_orders f          -- 10 billion rows (probe side ✓)
JOIN hive.sales.dim_customers d        -- 1 million rows  (build side ✓)
    ON f.customer_id = d.customer_id;

-- This is actually fine IF statistics are current.
-- The problem arises when statistics are stale and Trino misestimates sizes:

--  ANTI-PATTERN: Forcing a large result set as build side with a subquery
SELECT f.order_id, f.amount, sub.segment
FROM (
    SELECT customer_id, segment
    FROM hive.sales.dim_customers
    WHERE country IN ('US', 'CA', 'UK', 'AU', 'DE', 'FR')  -- returns 900K rows
) sub
JOIN hive.sales.fact_orders f ON sub.customer_id = f.customer_id;

Recommended Fix

--  BEST PRACTICE: Keep small lookup tables on the build side and update statistics
-- First, analyze tables to keep statistics fresh:
ANALYZE hive.sales.fact_orders;
ANALYZE hive.sales.dim_customers;

-- Use explicit join hints when statistics are unreliable:
SELECT /*+ broadcast(d) */
    f.order_id,
    f.amount,
    d.customer_name
FROM hive.sales.fact_orders f
JOIN hive.sales.dim_customers d
    ON f.customer_id = d.customer_id
WHERE d.country IN ('US', 'CA', 'UK');

Anti-Pattern #4: Data Skew in GROUP BY and JOIN Operations

Data skew — where one or a few values account for a disproportionate share of rows — is one of the most insidious Trino performance anti-patterns. A skewed key forces a single worker node to handle far more data than others, creating a bottleneck that stalls the entire query even if all other workers finish quickly.

Diagnosing Data Skew

-- Identify skew: check value distribution for join/group-by keys
SELECT
    customer_segment,
    COUNT(*) AS row_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM hive.sales.fact_orders
GROUP BY customer_segment
ORDER BY row_count DESC;

Remediation: Salting Technique

--  BEST PRACTICE: Salt skewed keys to distribute load
-- Step 1: Add a random salt to the large/skewed table
WITH salted_orders AS (
    SELECT
        order_id,
        customer_id || '_' || CAST(FLOOR(RAND() * 8) AS VARCHAR) AS salted_key,
        amount
    FROM hive.sales.fact_orders
    WHERE customer_segment = 'ENTERPRISE'   -- heavily skewed segment
),
-- Step 2: Explode the small dimension table to match all salt values
exploded_customers AS (
    SELECT
        customer_id || '_' || CAST(salt AS VARCHAR) AS salted_key,
        customer_name,
        segment
    FROM hive.sales.dim_customers
    CROSS JOIN UNNEST(SEQUENCE(0, 7)) AS t(salt)
)
SELECT
    s.order_id,
    s.amount,
    e.customer_name
FROM salted_orders s
JOIN exploded_customers e ON s.salted_key = e.salted_key;
The salting technique distributes skewed key processing across multiple workers by artificially diversifying the join key. This trades a small increase in data volume for dramatically better parallelism.

Anti-Pattern #5: Running Queries Without Table Statistics

Trino's Cost-Based Optimizer (CBO) relies heavily on table and column statistics to make intelligent decisions about join ordering, join type selection, and aggregation strategies. Operating without current statistics forces the optimizer to fall back on heuristics, which frequently produce suboptimal plans.
--  BEST PRACTICE: Analyze tables after significant data changes
-- Analyze entire table
ANALYZE hive.sales.fact_orders;

-- Analyze with column-level statistics for join/filter columns
ANALYZE hive.sales.fact_orders
WITH (columns = ARRAY['customer_id', 'transaction_date', 'region', 'amount']);

-- Verify statistics are present
SHOW STATS FOR hive.sales.fact_orders;
Run ANALYZE as part of your data pipeline after each significant load. For partitioned tables, you can analyze individual partitions to reduce the overhead of full-table analysis.

Anti-Pattern #6: Excessive Use of DISTINCT Without Purpose

Using DISTINCT adds a full deduplication pass to your query plan. Many engineers add it "just to be safe" when they suspect duplicates, without actually verifying whether duplicates exist. This forces Trino to sort and hash all output rows, which is expensive at scale.

Anti-Pattern Example

--  ANTI-PATTERN: Unnecessary DISTINCT causing full deduplication
SELECT DISTINCT
    o.customer_id,
    o.order_date,
    c.customer_name
FROM hive.sales.orders o
JOIN hive.sales.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE '2024-01-01';
-- If orders.customer_id is already unique per order, DISTINCT adds no value

Recommended Fix

--  BEST PRACTICE: Understand your data model; use DISTINCT only when necessary
-- Validate if duplicates actually exist before adding DISTINCT:
SELECT customer_id, order_date, COUNT(*) AS cnt
FROM hive.sales.orders
WHERE order_date >= DATE '2024-01-01'
GROUP BY customer_id, order_date
HAVING COUNT(*) > 1;
-- If no rows returned, DISTINCT is unnecessary

Anti-Pattern #7: Suboptimal Use of Window Functions

Window functions are powerful, but running multiple window functions with different PARTITION BY and ORDER BY clauses in a single query forces Trino to shuffle and sort the data multiple times. Each unique window specification creates a separate exchange and sort operation.

Anti-Pattern Example

--  ANTI-PATTERN: Multiple window functions with different partitions
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
    AVG(amount) OVER (PARTITION BY region ORDER BY order_date) AS region_avg,
    RANK() OVER (PARTITION BY product_category ORDER BY amount DESC) AS category_rank
FROM hive.sales.orders;
-- Creates 3 separate sort + exchange operations

Recommended Fix

--  BEST PRACTICE: Group window functions with the same specification together
-- Pre-aggregate where possible and split into separate CTEs
WITH customer_metrics AS (
    SELECT
        customer_id,
        order_date,
        amount,
        region,
        product_category,
        SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
    FROM hive.sales.orders
),
region_metrics AS (
    SELECT
        customer_id,
        order_date,
        AVG(amount) OVER (PARTITION BY region ORDER BY order_date) AS region_avg
    FROM hive.sales.orders
)
SELECT
    cm.customer_id,
    cm.order_date,
    cm.amount,
    cm.running_total,
    rm.region_avg
FROM customer_metrics cm
JOIN region_metrics rm
    ON cm.customer_id = rm.customer_id
    AND cm.order_date = rm.order_date;

Anti-Pattern #8: Not Leveraging Table Format Optimizations (Iceberg/Delta)

Many production Trino deployments still use plain Hive tables without exploiting the advanced features available in modern table formats like Apache Iceberg or Delta Lake. These formats offer partition evolution, hidden partitioning, file skipping via metadata, and time travel — all of which can dramatically improve query performance.
--  BEST PRACTICE: Use Iceberg for production tables with Trino
-- Create an Iceberg table with hidden partitioning
CREATE TABLE iceberg.sales.transactions (
    transaction_id    BIGINT,
    customer_id       BIGINT,
    amount            DECIMAL(18, 2),
    currency          VARCHAR(3),
    region            VARCHAR(50),
    transaction_date  DATE
)
WITH (
    format           = 'PARQUET',
    partitioning     = ARRAY['region', 'day(transaction_date)'],
    sorted_by        = ARRAY['customer_id']
);

-- Use Iceberg metadata tables for fast file-level statistics
SELECT
    partition,
    record_count,
    file_count,
    total_size
FROM iceberg.sales."transactions$partitions"
WHERE partition.transaction_date >= DATE '2024-06-01';
Iceberg's metadata-driven file skipping means Trino can skip entire data files based on column statistics without reading them — a massive performance advantage over traditional Hive tables.

Anti-Pattern #9: Misconfigured Memory and Spill Settings

Allowing queries to spill to disk without proper configuration is a double-edged sword. While spill prevents OOM failures, excessive spilling dramatically degrades query performance. Conversely, setting memory limits too aggressively causes frequent query failures in production.
#  BEST PRACTICE: Balanced Trino memory configuration (config.properties)

# Coordinator heap
-Xmx16G

# Worker JVM heap (jvm.config)
# -Xmx128G  (set to ~80% of available RAM on worker nodes)

# config.properties on workers
query.max-memory=50GB
query.max-memory-per-node=10GB
query.max-total-memory-per-node=12GB

# Enable spill only as a safety net, not a primary strategy
experimental.spill-enabled=true
experimental.spill-order-by=true
experimental.spill-window-operator=true
spiller-spill-path=/data/trino-spill
spiller-max-used-space-threshold=0.7

# Limit concurrent queries to prevent memory contention
query.max-concurrent-queries=50
Monitor your Trino UI at http://<coordinator>:8080 to track spill statistics. If you see significant spill activity, investigate whether query rewrites or increased memory allocation are more appropriate solutions.

Anti-Pattern #10: Bypassing the Query Result Cache

For dashboards and recurring analytical queries that hit the same data repeatedly, not using Trino's result cache (available via connector-level caching or Alluxio-based file caching) is a significant missed optimization. File caching reduces repeated remote storage I/O by orders of magnitude.
#  BEST PRACTICE: Enable native file system cache (Trino 400+)
# catalog/hive.properties
hive.cache.enabled=true
hive.cache.location=/data/trino-cache
hive.cache.disk-quota=500GB
hive.cache.ttl=24h

# For Iceberg catalogs
iceberg.file-format=PARQUET
iceberg.cache.enabled=true

Trino Performance Anti-Patterns: Quick Reference Table

Anti-Pattern Impact Fix
SELECT * High I/O, excess network transfer Specify required columns only
No partition pruning Full table scans Direct range predicates on partition columns
Wrong JOIN order OOM, slow hash builds ANALYZE tables; use broadcast hints
Data skew Single-node bottleneck Key salting; pre-aggregation
Missing statistics Poor CBO decisions Run ANALYZE regularly
Unnecessary DISTINCT Expensive deduplication Validate need; use GROUP BY
Multiple window specs Repeated sort + exchange Consolidate window functions
Plain Hive tables No file skipping/metadata Migrate to Iceberg/Delta
Misconfigured memory Excessive spill / OOM Tune per-node memory settings
No result/file cache Repeated remote I/O Enable native file cache

Monitoring Trino Performance Anti-Patterns with Observability Tools

Preventing anti-patterns requires continuous observability. The Trino Web UI (/ui) provides query plans, stage timelines, and operator statistics. For production environments, integrate Trino's JMX metrics with Prometheus and Grafana. Key metrics to monitor include trino.execution:QueryManager.RunningQueries, operator input/output row counts for skew detection, and SpilledBytes per query for memory pressure signals. For deeper query plan analysis, always use EXPLAIN ANALYZE on slow queries:
-- Use EXPLAIN ANALYZE to identify bottleneck operators
EXPLAIN ANALYZE
SELECT
    region,
    SUM(amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM hive.sales.transactions
WHERE transaction_date >= DATE '2024-01-01'
GROUP BY region;
The output will show actual row counts, CPU time, and memory usage per operator — invaluable for pinpointing which anti-pattern is hurting your specific query.

Conclusion: Mastering Trino Performance Anti-Patterns

Avoiding Trino performance anti-patterns requires a combination of query design discipline, proactive statistics management, and thoughtful cluster configuration. The ten anti-patterns covered in this post — from the deceptively simple SELECT * to the complex challenges of data skew — represent the most impactful issues we see in production Trino deployments at MinervaDB. Start by auditing your top 10 most expensive queries using the Trino Web UI, apply EXPLAIN ANALYZE to each, and look for the patterns described above. Even fixing two or three of these anti-patterns can yield dramatic, measurable improvements in query latency and cluster resource utilization.

Related Resources

About MinervaDB Corporation 292 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.