Integration, Data Integration, Data Virtualization

Trino Series: Optimizing and Expanding the Sample Project

March 15, 2021November 5, 2024 by Kinshuk Dutta

This entry is part 3 of 5 in the series TRINO Series

PRESTO / Trino Basics
Trino Series: Building a Sample Project on Local Installation
Trino Series: Optimizing and Expanding the Sample Project
Trino Series: Advanced Integrations with Cloud Storage
Trino Series: Caching Strategies and Query Performance Tuning

Introduction: Building on the Basics

In our last blog, we set up a local Trino project for a sample use case—Unified Sales Analytics—allowing us to query across PostgreSQL and MySQL databases. Now, we’ll build on this project by introducing optimizations for query performance, configuring advanced settings, and adding a new data source to broaden the project’s capabilities. These enhancements will simulate a real-world scenario where data is frequently queried, requiring efficient processing and additional flexibility.

Project Enhancement Overview

Goals for This Blog Post

Optimize Existing Queries: Improve query performance by using Trino’s advanced optimization features.
Add a New Data Source: Integrate Apache Kafka to simulate real-time data streaming, adding a layer of dynamic, continuously updating sales data.
Implement Partitioning: Set up data partitioning for more efficient storage and access, especially useful for large datasets in analytics.

Step 1: Optimizing Queries for Performance

1. Using Dynamic Filtering for Joins

Dynamic filtering is a powerful optimization feature in Trino that applies filter conditions from the “build” side of a join to the “probe” side, thereby reducing data scanned and processed. To enable dynamic filtering in Trino, update the config.properties file in your project:

Example of an Optimized Query Using Dynamic Filtering:

This query now scans only the necessary data for orders within the specified date range, thanks to dynamic filtering, reducing memory usage and processing time.

Step 2: Adding Real-Time Data from Apache Kafka

Configure Kafka as a Data Source

To simulate real-time sales data, let’s integrate Apache Kafka as an additional source for new transactions. Create a kafka.properties file in the etc/catalog directory:

In this setup:

Kafka acts as a streaming data source, with a topic named sales_transactions_topic.
Trino can query Kafka topics as tables, making it easy to incorporate streaming data into SQL queries.

Querying Real-Time Sales Data from Kafka

Now we can run queries that combine data from both the static orders table and the live Kafka stream:

This setup enables real-time sales analytics by including the latest transactions streamed from Kafka. For instance, businesses can analyze orders from the last seven days dynamically.

Step 3: Implementing Partitioning for Efficient Data Access

Partitioning data by date or region can significantly improve query speed by narrowing down the amount of data processed for each query. For the orders table in PostgreSQL, let’s partition the data by order_date to speed up time-based analytics.

Modify the Table with Partitioning:

Add Partitions by Month:

For optimal query performance, add partitions based on months:

Conclusion and Next Steps

In this post, we enhanced our initial Trino project by optimizing query performance, adding real-time data through Kafka, and implementing partitioning for more efficient data handling. These configurations make our setup more representative of a production-ready analytics system capable of handling large and dynamic datasets.

In the upcoming posts, we’ll explore additional integrations, such as connecting Trino with cloud storage for scalable data warehousing, and dive into caching strategies for frequently accessed data. These improvements will round out our Trino project and offer even more insights into practical uses of distributed SQL.

Stay tuned for the next post, where we’ll explore data caching techniques to reduce latency and further optimize our analytics workflow.

Series Navigation<< Trino Series: Building a Sample Project on Local InstallationTrino Series: Advanced Integrations with Cloud Storage >>