Trino Series: Building a Sample Project on Local Installation
- PRESTO / Trino Basics
- Trino Series: Building a Sample Project on Local Installation
- Trino Series: Optimizing and Expanding the Sample Project
- Trino Series: Advanced Integrations with Cloud Storage
- Trino Series: Caching Strategies and Query Performance Tuning
Why a Trino Series Instead of Presto?
If you followed the initial post in this series, you may recall we discussed the history of Presto and its recent transformation into what is now known as Trino. Originally developed as Presto at Facebook, this powerful SQL query engine has seen an incredible journey. The transition to Trino represents the evolution of PrestoSQL into a more robust, community-driven platform focused on advanced distributed SQL features.
The rebranding to Trino wasn’t merely a name change—it reflects a shift toward greater community collaboration, improved flexibility, and extended support for analytics across a wide variety of data sources. By starting this series with Trino, I aim to introduce readers to the latest advancements while retaining the familiar, high-performance querying engine that initially made Presto so impactful.
Timeline: Presto to Trino – Key Milestones
Here’s a brief look at the major milestones in Presto’s evolution to Trino:
- 2012: Presto was created by engineers Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang at Facebook to address the need for fast, distributed SQL analytics on large-scale data.
- 2013: Presto was released as an open-source project, quickly gaining popularity due to its high performance and compatibility with various data sources.
- 2018: The original creators of Presto (Martin, Dain, and David) left Facebook to focus on an open-source community version, which they named PrestoSQL. This version evolved separately from the original PrestoDB, which remained under Facebook’s oversight.
- 2020: To reduce confusion and clarify its independence, PrestoSQL was rebranded as Trino. This rebranding marked a shift in governance under the newly established Trino Software Foundation, aiming for a fully open and community-driven model.
- 2021: Trino continues to evolve, with a vibrant and active community developing new features, supporting a broader range of data sources, and enhancing performance for modern analytics.
By choosing Trino as the focal point of this series, we’re delving into a platform that has fully embraced open-source principles and is actively evolving to meet the complex needs of today’s data environments.
Introduction: Expanding on Trino with a Local Project
In the previous blog, we covered the basics of Trino and its setup. Now, we’ll dive deeper by building a sample project that highlights Trino’s power to query multiple data sources within a unified SQL interface. This project will guide you through the setup, project structure, and business goals for a hypothetical use case, showcasing how Trino enables fast, interactive data analytics.
Project Overview: Unified Sales Analytics
Business Goal
Our sample project will focus on Unified Sales Analytics. Imagine a scenario where a company’s sales data is stored across different data sources—one database contains historical transactions, while another houses customer demographic information. By using Trino, we’ll enable efficient cross-database querying to generate insights like customer purchase trends and sales by demographic.
Data Sources
We’ll simulate two data sources:
- Transactions Database (PostgreSQL): Stores transaction data such as order ID, customer ID, amount, and order date.
- Customer Database (MySQL): Contains customer information, including demographics, region, and signup date.
These databases will run locally, allowing us to create a fully functioning Trino environment on a single machine.
Setting Up the Project
Step 1: Create Project Directory Structure
Let’s begin by creating a structured directory for our project, following the convention used in the Trino community for organizing configuration and catalog files.
- etc/catalog: Stores configuration files for each data source.
- queries: Contains SQL queries for analysis tasks.
Step 2: Configure Trino Catalogs
We’ll configure Trino to connect to both databases by creating catalog files in the etc/catalog
directory.
- MySQL Catalog Configuration (
mysql.properties
): - PostgreSQL Catalog Configuration (
postgresql.properties
):
These configurations allow Trino to access both data sources, enabling cross-database queries.
Sample Queries for Analytics
Use Case 1: Customer Purchase Insights
To analyze customer purchasing trends, we’ll join data from the Transactions and Customer tables. This query will return insights such as the total spending per customer along with demographic information.
- Customer Purchase Insights (
customer_insights.sql
):
This query highlights Trino’s capability to perform joins across databases seamlessly, providing actionable insights on customer segments with the highest purchase volume.
Use Case 2: Sales Analysis by Region
For business planning, we may want to see sales broken down by region to understand which areas are performing best.
- Sales Analysis by Region (
sales_analysis.sql
):
This report allows businesses to identify top-performing regions, helping to guide decisions on resource allocation or targeted marketing.
Running the Queries
With Trino running, you can execute these queries through the Trino CLI:
- Start Trino Server:
- Execute Query in CLI:
Use the queries/customer_insights.sql
and queries/sales_analysis.sql
files to analyze data across databases directly from the Trino interface.
Conclusion and Next Steps
This sample project demonstrates Trino’s power in a local setup, querying multiple data sources seamlessly. In the next blog, we’ll dive into optimizing these queries for better performance and exploring advanced Trino features for real-time analytics.
Stay tuned as we explore more about Trino’s capabilities in modern data architectures!