Integration, Data Integration, Data Virtualization

Trino Series: Building a Sample Project on Local Installation

This entry is part 2 of 5 in the series TRINO Series

Why a Trino Series Instead of Presto?

If you followed the initial post in this series, you may recall we discussed the history of Presto and its recent transformation into what is now known as Trino. Originally developed as Presto at Facebook, this powerful SQL query engine has seen an incredible journey. The transition to Trino represents the evolution of PrestoSQL into a more robust, community-driven platform focused on advanced distributed SQL features.

The rebranding to Trino wasn’t merely a name change—it reflects a shift toward greater community collaboration, improved flexibility, and extended support for analytics across a wide variety of data sources. By starting this series with Trino, I aim to introduce readers to the latest advancements while retaining the familiar, high-performance querying engine that initially made Presto so impactful.

Timeline: Presto to Trino – Key Milestones

Here’s a brief look at the major milestones in Presto’s evolution to Trino:

  • 2012: Presto was created by engineers Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang at Facebook to address the need for fast, distributed SQL analytics on large-scale data.
  • 2013: Presto was released as an open-source project, quickly gaining popularity due to its high performance and compatibility with various data sources.
  • 2018: The original creators of Presto (Martin, Dain, and David) left Facebook to focus on an open-source community version, which they named PrestoSQL. This version evolved separately from the original PrestoDB, which remained under Facebook’s oversight.
  • 2020: To reduce confusion and clarify its independence, PrestoSQL was rebranded as Trino. This rebranding marked a shift in governance under the newly established Trino Software Foundation, aiming for a fully open and community-driven model.
  • 2021: Trino continues to evolve, with a vibrant and active community developing new features, supporting a broader range of data sources, and enhancing performance for modern analytics.

By choosing Trino as the focal point of this series, we’re delving into a platform that has fully embraced open-source principles and is actively evolving to meet the complex needs of today’s data environments.

 


Introduction: Expanding on Trino with a Local Project

In the previous blog, we covered the basics of Trino and its setup. Now, we’ll dive deeper by building a sample project that highlights Trino’s power to query multiple data sources within a unified SQL interface. This project will guide you through the setup, project structure, and business goals for a hypothetical use case, showcasing how Trino enables fast, interactive data analytics.

Project Overview: Unified Sales Analytics

Business Goal

Our sample project will focus on Unified Sales Analytics. Imagine a scenario where a company’s sales data is stored across different data sources—one database contains historical transactions, while another houses customer demographic information. By using Trino, we’ll enable efficient cross-database querying to generate insights like customer purchase trends and sales by demographic.

Data Sources

We’ll simulate two data sources:

  • Transactions Database (PostgreSQL): Stores transaction data such as order ID, customer ID, amount, and order date.
  • Customer Database (MySQL): Contains customer information, including demographics, region, and signup date.

These databases will run locally, allowing us to create a fully functioning Trino environment on a single machine.


Setting Up the Project

Step 1: Create Project Directory Structure

Let’s begin by creating a structured directory for our project, following the convention used in the Trino community for organizing configuration and catalog files.

plaintext
trino_sales_analytics/
├── etc/
│ ├── catalog/
│ │ ├── mysql.properties
│ │ └── postgresql.properties
│ ├── config.properties
│ ├── jvm.config
│ └── log.properties
└── queries/
├── customer_insights.sql
└── sales_analysis.sql
  • etc/catalog: Stores configuration files for each data source.
  • queries: Contains SQL queries for analysis tasks.

Step 2: Configure Trino Catalogs

We’ll configure Trino to connect to both databases by creating catalog files in the etc/catalog directory.

  1. MySQL Catalog Configuration (mysql.properties):
    plaintext
    connector.name=mysql
    connection-url=jdbc:mysql://localhost:3306/customer_db
    connection-user=root
    connection-password=root_password
  2. PostgreSQL Catalog Configuration (postgresql.properties):
    plaintext
    connector.name=postgresql
    connection-url=jdbc:postgresql://localhost:5432/transactions_db
    connection-user=postgres
    connection-password=postgres_password

These configurations allow Trino to access both data sources, enabling cross-database queries.


Sample Queries for Analytics

Use Case 1: Customer Purchase Insights

To analyze customer purchasing trends, we’ll join data from the Transactions and Customer tables. This query will return insights such as the total spending per customer along with demographic information.

  1. Customer Purchase Insights (customer_insights.sql):
    sql
    SELECT
    c.customer_id,
    c.region,
    c.age_group,
    SUM(t.amount) AS total_spent,
    COUNT(t.order_id) AS total_orders
    FROM
    mysql.customer_db.customers AS c
    JOIN
    postgresql.transactions_db.orders AS t
    ON c.customer_id = t.customer_id
    GROUP BY
    c.customer_id, c.region, c.age_group
    ORDER BY
    total_spent DESC

This query highlights Trino’s capability to perform joins across databases seamlessly, providing actionable insights on customer segments with the highest purchase volume.

Use Case 2: Sales Analysis by Region

For business planning, we may want to see sales broken down by region to understand which areas are performing best.

  1. Sales Analysis by Region (sales_analysis.sql):
    sql
    SELECT
    c.region,
    COUNT(t.order_id) AS orders_count,
    SUM(t.amount) AS total_revenue
    FROM
    postgresql.transactions_db.orders AS t
    JOIN
    mysql.customer_db.customers AS c
    ON t.customer_id = c.customer_id
    GROUP BY
    c.region
    ORDER BY
    total_revenue DESC

This report allows businesses to identify top-performing regions, helping to guide decisions on resource allocation or targeted marketing.


Running the Queries

With Trino running, you can execute these queries through the Trino CLI:

  1. Start Trino Server:
    bash
    presto-server run
  2. Execute Query in CLI:
    bash
    ./presto-cli-0.247-executable.jar --server localhost:8080 --catalog mysql --schema customer_db

Use the queries/customer_insights.sql and queries/sales_analysis.sql files to analyze data across databases directly from the Trino interface.


Conclusion and Next Steps

This sample project demonstrates Trino’s power in a local setup, querying multiple data sources seamlessly. In the next blog, we’ll dive into optimizing these queries for better performance and exploring advanced Trino features for real-time analytics.

Stay tuned as we explore more about Trino’s capabilities in modern data architectures!

Series Navigation<< PRESTO / Trino BasicsTrino Series: Optimizing and Expanding the Sample Project >>