Integration, Data Integration, Data Virtualization

Trino Series: Optimizing and Expanding the Sample Project

This entry is part 3 of 5 in the series TRINO Series

Introduction: Building on the Basics

In our last blog, we set up a local Trino project for a sample use case—Unified Sales Analytics—allowing us to query across PostgreSQL and MySQL databases. Now, we’ll build on this project by introducing optimizations for query performance, configuring advanced settings, and adding a new data source to broaden the project’s capabilities. These enhancements will simulate a real-world scenario where data is frequently queried, requiring efficient processing and additional flexibility.

Project Enhancement Overview

Goals for This Blog Post

  1. Optimize Existing Queries: Improve query performance by using Trino’s advanced optimization features.
  2. Add a New Data Source: Integrate Apache Kafka to simulate real-time data streaming, adding a layer of dynamic, continuously updating sales data.
  3. Implement Partitioning: Set up data partitioning for more efficient storage and access, especially useful for large datasets in analytics.

Step 1: Optimizing Queries for Performance

1. Using Dynamic Filtering for Joins

Dynamic filtering is a powerful optimization feature in Trino that applies filter conditions from the “build” side of a join to the “probe” side, thereby reducing data scanned and processed. To enable dynamic filtering in Trino, update the config.properties file in your project:

plaintext
dynamic-filtering.enabled=true

Example of an Optimized Query Using Dynamic Filtering:

sql
SELECT
c.customer_id,
c.region,
c.age_group,
SUM(t.amount) AS total_spent
FROM
mysql.customer_db.customers AS c
JOIN
postgresql.transactions_db.orders AS t
ON c.customer_id = t.customer_id
WHERE
t.order_date BETWEEN DATE '2022-01-01' AND DATE '2022-12-31'
GROUP BY
c.customer_id, c.region, c.age_group
ORDER BY
total_spent DESC

This query now scans only the necessary data for orders within the specified date range, thanks to dynamic filtering, reducing memory usage and processing time.


Step 2: Adding Real-Time Data from Apache Kafka

Configure Kafka as a Data Source

To simulate real-time sales data, let’s integrate Apache Kafka as an additional source for new transactions. Create a kafka.properties file in the etc/catalog directory:

plaintext
connector.name=kafka
kafka.nodes=localhost:9092
kafka.table-names=sales_transactions
kafka.default-schema=default
kafka.topic-description=sales_transactions_topic

In this setup:

  • Kafka acts as a streaming data source, with a topic named sales_transactions_topic.
  • Trino can query Kafka topics as tables, making it easy to incorporate streaming data into SQL queries.

Querying Real-Time Sales Data from Kafka

Now we can run queries that combine data from both the static orders table and the live Kafka stream:

sql
SELECT
t.order_id,
t.customer_id,
t.amount,
k.transaction_time
FROM
postgresql.transactions_db.orders AS t
JOIN
kafka.sales_transactions AS k
ON t.order_id = k.order_id
WHERE
k.transaction_time >= current_date - interval '7' day
ORDER BY
transaction_time DESC;

This setup enables real-time sales analytics by including the latest transactions streamed from Kafka. For instance, businesses can analyze orders from the last seven days dynamically.


Step 3: Implementing Partitioning for Efficient Data Access

Partitioning data by date or region can significantly improve query speed by narrowing down the amount of data processed for each query. For the orders table in PostgreSQL, let’s partition the data by order_date to speed up time-based analytics.

  1. Modify the Table with Partitioning:
sql
CREATE TABLE IF NOT EXISTS orders_partitioned (
order_id bigint,
customer_id bigint,
amount double,
order_date date
) PARTITION BY RANGE (order_date);
  1. Add Partitions by Month:

For optimal query performance, add partitions based on months:

sql
CREATE TABLE orders_2022_01 PARTITION OF orders_partitioned FOR VALUES FROM ('2022-01-01') TO ('2022-01-31');
CREATE TABLE orders_2022_02 PARTITION OF orders_partitioned FOR VALUES FROM ('2022-02-01') TO ('2022-02-28');
-- Continue this pattern for each month or automate it based on your needs.

Conclusion and Next Steps

In this post, we enhanced our initial Trino project by optimizing query performance, adding real-time data through Kafka, and implementing partitioning for more efficient data handling. These configurations make our setup more representative of a production-ready analytics system capable of handling large and dynamic datasets.

In the upcoming posts, we’ll explore additional integrations, such as connecting Trino with cloud storage for scalable data warehousing, and dive into caching strategies for frequently accessed data. These improvements will round out our Trino project and offer even more insights into practical uses of distributed SQL.

Stay tuned for the next post, where we’ll explore data caching techniques to reduce latency and further optimize our analytics workflow.


Series Navigation<< Trino Series: Building a Sample Project on Local InstallationTrino Series: Advanced Integrations with Cloud Storage >>