TRINO Series - Data-Nizant

Trino Series: Caching Strategies and Query Performance Tuning

This entry is part 5 of 5 in the series TRINO Series

Introduction: Enhancing Trino Performance In our journey with Trino, we’ve explored its setup, integrated it with multiple data sources, added real-time data, and expanded to cloud storage. To wrap up, we’ll focus on strategies to improve query performance. Specifically, we’ll implement caching techniques and apply performance tuning to optimize queries for frequent data access. This final post aims to equip you with tools for building a highly responsive and efficient Trino-powered analytics environment. Goals for This Post Implement Caching for Frequent Queries: Set up a local cache for repeated queries to reduce data retrieval times and resource consumption. Tune Query Performance: Explore query optimizations, including resource management and query rewriting, for faster processing. Optimize Resource Allocation: Configure Trino’s memory and CPU settings for optimal utilization in a high-demand environment. Step 1: Setting Up Caching for Frequent Queries 1.1 Configure a Local Cache Trino’s caching feature allows frequently accessed data to be stored temporarily, reducing repeated calls to external data sources. This is particularly beneficial when working with large datasets or slow data sources like S3. Add the Local Cache Plugin: Begin by configuring Trino to use a caching plugin. In etc/config.properties, enable local caching: plaintext Copy code cache.enabled=true cache.location=/path/to/cache cache.max-size=1GB cache.expiration-time=10m cache.location: Specifies the directory where cached files are stored. cache.max-size: Defines the maximum cache size. cache.expiration-time: Sets the duration cached data remains before it’s refreshed. Specify Cacheable Tables: For specific tables that are frequently queried, configure Trino to cache the results temporarily. For instance, if queries often access a product table stored in S3, caching will improve response times: sql Copy code CREATE TABLE cached_products WITH ( format = ‘ORC’, external_location = ‘s3a://your_bucket_name/product_data.orc’ ); After creating the cached table, frequently accessed data will be stored locally, reducing dependency on external data retrieval. Step 2: Tuning Query Performance with Optimization Techniques 2.1 Using Predicate Pushdown Predicate pushdown pushes filter conditions down to the data source, minimizing the amount of data fetched and processed. This feature is especially useful for querying large datasets in cloud storage. Enable predicate pushdown in config.properties: plaintext Copy code optimizer.pushdown-enabled=true Example Query with Predicate Pushdown: sql Copy code SELECT order_id, customer_id, amount FROM hive.default.sales_data WHERE order_date >= DATE ‘2023-01-01’ With pushdown enabled, only records with order_date >= DATE ‘2023-01-01’ are fetched from the data source, speeding up query execution. 2.2 Leveraging Data Partitioning Partitioning tables by fields like order_date or region allows Trino to read only the necessary partitions, further enhancing performance. For example: sql Copy code CREATE TABLE hive.default.sales_data_partitioned WITH ( format = ‘ORC’, partitioned_by = ARRAY[‘order_date’] ) AS SELECT * FROM hive.default.sales_data; This setup minimizes data scanning for date-based queries. Step 3: Configuring Resource Allocation for High Demand Optimizing Trino’s resource allocation ensures your queries run smoothly under heavy workloads. We’ll focus on memory and CPU settings that balance performance and resource usage. Memory Settings: In jvm.config, allocate sufficient memory to handle large datasets: plaintext Copy code -Xmx16G -Xms8G -XX:+UseG1GC -Xmx: Maximum memory allocation. -Xms: Initial memory allocation. CPU Configuration: Trino allows you to limit CPU usage for each query. Configure cpu.cores-per-node in config.properties to manage parallelism: plaintext Copy code cpu.cores-per-node=4 Session-Specific Settings: For specific sessions that require additional memory, use session settings to allocate memory directly in your queries: sql Copy code SET SESSION query_max_memory = ‘4GB’; Conclusion: Building an Optimized Trino Environment This final post has introduced performance-enhancing strategies to make Trino faster, more efficient, and scalable. With caching, predicate pushdown, partitioning, and resource tuning, you now have a highly optimized setup capable of handling complex and frequent analytics queries with ease. By following this series, you’ve gained hands-on experience with Trino, from setting up a local project to integrating real-time and cloud data, and now to fine-tuning performance. The knowledge here not only applies to Trino but also serves as a foundation for understanding best practices in distributed SQL engines. Thank you for following along, and I hope this series has been a valuable resource for exploring Trino’s potential in modern data analytics!

Trino Series: Advanced Integrations with Cloud Storage

This entry is part 4 of 5 in the series TRINO Series

Introduction: Scaling Data with Cloud Storage In the previous blogs, we explored building a sample project locally, optimizing queries, and adding real-time data streaming. Now, let’s take our Trino project a step further by connecting it to cloud storage, specifically Amazon S3. This integration will showcase how Trino can handle large datasets beyond local storage, making it suitable for scalable, cloud-based data warehousing. By connecting Trino to S3, we can expand our data analytics project to manage vast datasets with flexibility and efficiency. Project Enhancement Overview Goals for This Blog Post Integrate Amazon S3 with Trino: Configure Trino to access data stored in S3 buckets, setting up Trino to function as a data lake query engine. Query Data Directly from S3: Execute SQL queries on large datasets in S3 without moving data locally. Use Cases for Cloud Data Warehousing: Explore how cloud storage integration supports diverse analytics use cases. Step 1: Configuring Trino to Connect to Amazon S3 To enable Trino to read from S3, we’ll set up an S3-compatible Hive catalog that allows querying data in S3 as if it were a database. This configuration will create a bridge between Trino and the cloud storage layer, giving us access to S3-stored files. 1.1 Setting Up Hive Catalog for S3 In the etc/catalog directory, create a new file called hive.properties with the following details: plaintext Copy code connector.name=hive hive.s3.aws-access-key=your_access_key hive.s3.aws-secret-key=your_secret_key hive.s3.endpoint=https://s3.amazonaws.com hive.metastore=file hive.metastore.catalog.dir=s3a://your_bucket_name/hive/ aws-access-key and aws-secret-key: Your AWS credentials for accessing S3. s3.endpoint: The endpoint for S3; for most users, this will be https://s3.amazonaws.com. metastore.catalog.dir: This points to the S3 path that will store the Hive metastore data. Replace your_bucket_name with the name of your S3 bucket. Tip: Use environment variables or AWS IAM roles for secure access instead of hardcoding credentials. Step 2: Creating and Querying an S3 Dataset 2.1 Preparing Data in S3 To follow along, prepare a sample dataset in S3. Let’s assume we’re using a CSV file named sales_data.csv stored in the S3 bucket, with a schema including order_id, customer_id, amount, order_date, and region. Upload this file to your S3 bucket under a directory called sales. 2.2 Creating an External Table in Trino With the data in place, create an external table in Trino that points to the S3-stored CSV. Use the following SQL query in the Trino CLI: sql Copy code CREATE TABLE hive.default.sales_data ( order_id bigint, customer_id bigint, amount double, order_date date, region varchar ) WITH ( format = ‘CSV’, external_location = ‘s3a://your_bucket_name/sales/sales_data.csv’ ); This command registers the CSV in S3 as an external table in Trino. Now, you can query sales_data just as if it were stored in a traditional database. 2.3 Running Queries on the S3 Data With the sales_data table available in Trino, you can now run complex analytics queries directly on the data in S3. For example, to analyze total sales by region, use: sql Copy code SELECT region, SUM(amount) AS total_revenue, COUNT(order_id) AS total_orders FROM hive.default.sales_data GROUP BY region ORDER BY total_revenue DESC; Since Trino queries the data directly in S3, there’s no need to move data locally, making it highly efficient for large-scale datasets. Use Cases for Cloud Data Warehousing with Trino By connecting to S3, Trino functions as a data lake query engine, enabling several powerful use cases: Scalable Data Analytics: Analyze large datasets without data duplication or complex ETL pipelines. Unified Data Access: Trino allows querying of S3 data alongside local databases, providing a unified view for analytics. Cost Efficiency: Storing large datasets in S3 and querying them with Trino can be more cost-effective than maintaining a traditional data warehouse. Best Practices for Using Trino with Cloud Storage Partitioning: Organize data in S3 by date, region, or another category to improve query performance. Columnar Formats: Consider converting CSV data to columnar formats like Parquet or ORC for more efficient storage and faster queries. Caching: Use Trino’s caching feature to store frequently accessed data locally, reducing latency and S3 access costs. Conclusion and Next Steps Integrating Trino with cloud storage like Amazon S3 unlocks a new level of scalability and flexibility for data warehousing. With this setup, you can query and analyze large, diverse datasets in a cost-efficient manner, making Trino a powerful tool for modern data architectures. In the next blog, we’ll explore caching strategies and further optimize query performance for frequently accessed datasets. Stay tuned as we continue to enhance our Trino project, building a robust analytics solution that meets the demands of big data.

Trino Series: Optimizing and Expanding the Sample Project

This entry is part 3 of 5 in the series TRINO Series

Introduction: Building on the Basics In our last blog, we set up a local Trino project for a sample use case—Unified Sales Analytics—allowing us to query across PostgreSQL and MySQL databases. Now, we’ll build on this project by introducing optimizations for query performance, configuring advanced settings, and adding a new data source to broaden the project’s capabilities. These enhancements will simulate a real-world scenario where data is frequently queried, requiring efficient processing and additional flexibility. Project Enhancement Overview Goals for This Blog Post Optimize Existing Queries: Improve query performance by using Trino’s advanced optimization features. Add a New Data Source: Integrate Apache Kafka to simulate real-time data streaming, adding a layer of dynamic, continuously updating sales data. Implement Partitioning: Set up data partitioning for more efficient storage and access, especially useful for large datasets in analytics. Step 1: Optimizing Queries for Performance 1. Using Dynamic Filtering for Joins Dynamic filtering is a powerful optimization feature in Trino that applies filter conditions from the “build” side of a join to the “probe” side, thereby reducing data scanned and processed. To enable dynamic filtering in Trino, update the config.properties file in your project: plaintext Copy code dynamic-filtering.enabled=true Example of an Optimized Query Using Dynamic Filtering: sql Copy code SELECT c.customer_id, c.region, c.age_group, SUM(t.amount) AS total_spent FROM mysql.customer_db.customers AS c JOIN postgresql.transactions_db.orders AS t ON c.customer_id = t.customer_id WHERE t.order_date BETWEEN DATE ‘2022-01-01’ AND DATE ‘2022-12-31’ GROUP BY c.customer_id, c.region, c.age_group ORDER BY total_spent DESC This query now scans only the necessary data for orders within the specified date range, thanks to dynamic filtering, reducing memory usage and processing time. Step 2: Adding Real-Time Data from Apache Kafka Configure Kafka as a Data Source To simulate real-time sales data, let’s integrate Apache Kafka as an additional source for new transactions. Create a kafka.properties file in the etc/catalog directory: plaintext Copy code connector.name=kafka kafka.nodes=localhost:9092 kafka.table-names=sales_transactions kafka.default-schema=default kafka.topic-description=sales_transactions_topic In this setup: Kafka acts as a streaming data source, with a topic named sales_transactions_topic. Trino can query Kafka topics as tables, making it easy to incorporate streaming data into SQL queries. Querying Real-Time Sales Data from Kafka Now we can run queries that combine data from both the static orders table and the live Kafka stream: sql Copy code SELECT t.order_id, t.customer_id, t.amount, k.transaction_time FROM postgresql.transactions_db.orders AS t JOIN kafka.sales_transactions AS k ON t.order_id = k.order_id WHERE k.transaction_time >= current_date – interval ‘7’ day ORDER BY transaction_time DESC; This setup enables real-time sales analytics by including the latest transactions streamed from Kafka. For instance, businesses can analyze orders from the last seven days dynamically. Step 3: Implementing Partitioning for Efficient Data Access Partitioning data by date or region can significantly improve query speed by narrowing down the amount of data processed for each query. For the orders table in PostgreSQL, let’s partition the data by order_date to speed up time-based analytics. Modify the Table with Partitioning: sql Copy code CREATE TABLE IF NOT EXISTS orders_partitioned ( order_id bigint, customer_id bigint, amount double, order_date date ) PARTITION BY RANGE (order_date); Add Partitions by Month: For optimal query performance, add partitions based on months: sql Copy code CREATE TABLE orders_2022_01 PARTITION OF orders_partitioned FOR VALUES FROM (‘2022-01-01’) TO (‘2022-01-31’); CREATE TABLE orders_2022_02 PARTITION OF orders_partitioned FOR VALUES FROM (‘2022-02-01’) TO (‘2022-02-28’); — Continue this pattern for each month or automate it based on your needs. Conclusion and Next Steps In this post, we enhanced our initial Trino project by optimizing query performance, adding real-time data through Kafka, and implementing partitioning for more efficient data handling. These configurations make our setup more representative of a production-ready analytics system capable of handling large and dynamic datasets. In the upcoming posts, we’ll explore additional integrations, such as connecting Trino with cloud storage for scalable data warehousing, and dive into caching strategies for frequently accessed data. These improvements will round out our Trino project and offer even more insights into practical uses of distributed SQL. Stay tuned for the next post, where we’ll explore data caching techniques to reduce latency and further optimize our analytics workflow.

Trino Series: Building a Sample Project on Local Installation

This entry is part 2 of 5 in the series TRINO Series

Why a Trino Series Instead of Presto? If you followed the initial post in this series, you may recall we discussed the history of Presto and its recent transformation into what is now known as Trino. Originally developed as Presto at Facebook, this powerful SQL query engine has seen an incredible journey. The transition to Trino represents the evolution of PrestoSQL into a more robust, community-driven platform focused on advanced distributed SQL features. The rebranding to Trino wasn’t merely a name change—it reflects a shift toward greater community collaboration, improved flexibility, and extended support for analytics across a wide variety of data sources. By starting this series with Trino, I aim to introduce readers to the latest advancements while retaining the familiar, high-performance querying engine that initially made Presto so impactful. Timeline: Presto to Trino – Key Milestones Here’s a brief look at the major milestones in Presto’s evolution to Trino: 2012: Presto was created by engineers Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang at Facebook to address the need for fast, distributed SQL analytics on large-scale data. 2013: Presto was released as an open-source project, quickly gaining popularity due to its high performance and compatibility with various data sources. 2018: The original creators of Presto (Martin, Dain, and David) left Facebook to focus on an open-source community version, which they named PrestoSQL. This version evolved separately from the original PrestoDB, which remained under Facebook’s oversight. 2020: To reduce confusion and clarify its independence, PrestoSQL was rebranded as Trino. This rebranding marked a shift in governance under the newly established Trino Software Foundation, aiming for a fully open and community-driven model. 2021: Trino continues to evolve, with a vibrant and active community developing new features, supporting a broader range of data sources, and enhancing performance for modern analytics. By choosing Trino as the focal point of this series, we’re delving into a platform that has fully embraced open-source principles and is actively evolving to meet the complex needs of today’s data environments. Introduction: Expanding on Trino with a Local Project In the previous blog, we covered the basics of Trino and its setup. Now, we’ll dive deeper by building a sample project that highlights Trino’s power to query multiple data sources within a unified SQL interface. This project will guide you through the setup, project structure, and business goals for a hypothetical use case, showcasing how Trino enables fast, interactive data analytics. Project Overview: Unified Sales Analytics Business Goal Our sample project will focus on Unified Sales Analytics. Imagine a scenario where a company’s sales data is stored across different data sources—one database contains historical transactions, while another houses customer demographic information. By using Trino, we’ll enable efficient cross-database querying to generate insights like customer purchase trends and sales by demographic. Data Sources We’ll simulate two data sources: Transactions Database (PostgreSQL): Stores transaction data such as order ID, customer ID, amount, and order date. Customer Database (MySQL): Contains customer information, including demographics, region, and signup date. These databases will run locally, allowing us to create a fully functioning Trino environment on a single machine. Setting Up the Project Step 1: Create Project Directory Structure Let’s begin by creating a structured directory for our project, following the convention used in the Trino community for organizing configuration and catalog files. plaintext Copy code trino_sales_analytics/ ├── etc/ │ ├── catalog/ │ │ ├── mysql.properties │ │ └── postgresql.properties │ ├── config.properties │ ├── jvm.config │ └── log.properties └── queries/ ├── customer_insights.sql └── sales_analysis.sql etc/catalog: Stores configuration files for each data source. queries: Contains SQL queries for analysis tasks. Step 2: Configure Trino Catalogs We’ll configure Trino to connect to both databases by creating catalog files in the etc/catalog directory. MySQL Catalog Configuration (mysql.properties): plaintext Copy code connector.name=mysql connection-url=jdbc:mysql://localhost:3306/customer_db connection-user=root connection-password=root_password PostgreSQL Catalog Configuration (postgresql.properties): plaintext Copy code connector.name=postgresql connection-url=jdbc:postgresql://localhost:5432/transactions_db connection-user=postgres connection-password=postgres_password These configurations allow Trino to access both data sources, enabling cross-database queries. Sample Queries for Analytics Use Case 1: Customer Purchase Insights To analyze customer purchasing trends, we’ll join data from the Transactions and Customer tables. This query will return insights such as the total spending per customer along with demographic information. Customer Purchase Insights (customer_insights.sql): sql Copy code SELECT c.customer_id, c.region, c.age_group, SUM(t.amount) AS total_spent, COUNT(t.order_id) AS total_orders FROM mysql.customer_db.customers AS c JOIN postgresql.transactions_db.orders AS t ON c.customer_id = t.customer_id GROUP BY c.customer_id, c.region, c.age_group ORDER BY total_spent DESC This query highlights Trino’s capability to perform joins across databases seamlessly, providing actionable insights on customer segments with the highest purchase volume. Use Case 2: Sales Analysis by Region For business planning, we may want to see sales broken down by region to understand which areas are performing best. Sales Analysis by Region (sales_analysis.sql): sql Copy code SELECT c.region, COUNT(t.order_id) AS orders_count, SUM(t.amount) AS total_revenue FROM postgresql.transactions_db.orders AS t JOIN mysql.customer_db.customers AS c ON t.customer_id = c.customer_id GROUP BY c.region ORDER BY total_revenue DESC This report allows businesses to identify top-performing regions, helping to guide decisions on resource allocation or targeted marketing. Running the Queries With Trino running, you can execute these queries through the Trino CLI: Start Trino Server: bash Copy code presto-server run Execute Query in CLI: bash Copy code ./presto-cli-0.247-executable.jar –server localhost:8080 –catalog mysql –schema customer_db Use the queries/customer_insights.sql and queries/sales_analysis.sql files to analyze data across databases directly from the Trino interface. Conclusion and Next Steps This sample project demonstrates Trino’s power in a local setup, querying multiple data sources seamlessly. In the next blog, we’ll dive into optimizing these queries for better performance and exploring advanced Trino features for real-time analytics. Stay tuned as we explore more about Trino’s capabilities in modern data architectures!

PRESTO / Trino Basics

This entry is part 1 of 5 in the series TRINO Series

Introduction: My Journey into Presto My interest in Presto was sparked in early 2021 after an enriching conversation with Brian Luisi, PreSales Manager at Starburst. His insights into distributed SQL query engines opened my eyes to the unique capabilities and performance advantages of Presto. Eager to dive deeper, I joined the Presto community on Slack to keep up with developments and collaborate with like-minded professionals. This blog series is an extension of that journey, aiming to demystify Presto and share my learnings with others curious about distributed analytics solutions. What is PRESTO Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, PostgresSQL, Cassandra, Kafka, MongoDB and Teradata. One can even query data from multiple data sources within a single query. Wikipedia Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto Often people are confused between PrestoDB and PrestoSQL. Follow the link to understand the difference between the two. In order to remove confusion in Dec 2020 PrestoSQL was rebranded as Trino. Presto DB Presto SQL Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang created PrestoDB in 2012 while at Facebook. Martin, Dain, and David left Facebook to pursue building the Presto Open Source Community full-time, under the new name PrestoSQL in 2018 Presto project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation® The Trino project is community driven project under the non-profit Trino Software Foundation. While both Presto and Trino serve similar purposes, the TRINO community has emerged as particularly vibrant, attracting a diverse array of contributors and user scenarios. I’ve found the activity on the Trino community Slack invaluable for staying informed about the latest improvements and real-world applications. Why Use PRESTO? Presto offers a unique advantage by allowing you to query structured and unstructured data sources with high performance. This makes it an excellent choice for enterprises needing flexibility in handling mixed data formats while retaining SQL’s familiar syntax. Moreover, its distributed nature and scalability mean it can handle massive datasets efficiently, even with complex joins and aggregations. Installation Process: Getting PrestoDB on Your Mac One can follow the step by step installation process as described by the Presto 0.247 Documentation from here. As you all are aware I use Homebrew for all my installation wherever possible. I will be doing the same for this one as well. If presto db is already installed. You will can either reinstall it by or upgrade it by Installation log: As part of the installation process. The following files are created at the folder of the home directory. It also creates a sample JMX connector under the etc/catalog folder. Next, add a connector. Here’s the list of available ones. For PostgreSQL, create /usr/local/Cellar/prestodb/0.247/libexec/etc/postgresql.properties with: Starting PrestoDB Troubleshoot Console log Followed the solution as per the link and updated the postgresql.properties by removing the <databaseName> Restarted the server. But this time, I was facing a different error. In terminal By default, most Java web application servers are listening on port 8080, and it can easily cause the popular 8080 port conflict error. In Mac OSX, you can use sudo lsof -i :8080 | grep LISTEN to find out what program is listening on port 8080 : In terminal The name “java” doesn’t tell you anything, to get the detail, ps the java PID 46655 like this : The Zookeeper is using the port 8080. I stopped both KAFKA and zookeeper. Restarted PrestoDB This time got the success message in terminal Accessing the Web UI: After starting presto server, you can access the web UI using this link Command-line interface CLI The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable. We can use the CLI to connect to the presto server. As a best practice. Take a backup of the original jar. Then make it executable with chmod +x, and run it. On terminal execute the CLI. Basic Operations Use Case 1: Transferring selective records from one table to another. Create a Table Create another table for active orders. Using pgAdmin4 inserted data to the order table. Some orders are active, some are inactive. The goal is to now fetch the active orders and push them into active orders table.