PRESTO / Trino Basics
Introduction: My Journey into Presto
My interest in Presto was sparked in early 2021 after an enriching conversation with Brian Luisi, PreSales Manager at Starburst. His insights into distributed SQL query engines opened my eyes to the unique capabilities and performance advantages of Presto. Eager to dive deeper, I joined the Presto community on Slack to keep up with developments and collaborate with like-minded professionals. This blog series is an extension of that journey, aiming to demystify Presto and share my learnings with others curious about distributed analytics solutions.
What is PRESTO
Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, PostgresSQL, Cassandra, Kafka, MongoDB and Teradata. One can even query data from multiple data sources within a single query. Wikipedia
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto
Often people are confused between PrestoDB and PrestoSQL. Follow the link to understand the difference between the two. In order to remove confusion in Dec 2020 PrestoSQL was rebranded as Trino.
Presto DB | Presto SQL |
Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang created PrestoDB in 2012 while at Facebook. | Martin, Dain, and David left Facebook to pursue building the Presto Open Source Community full-time, under the new name PrestoSQL in 2018 |
Presto project is supported by the Presto Foundation, an independent nonprofit organization with open and neutral governance, hosted under the Linux Foundation® | The Trino project is community driven project under the non-profit Trino Software Foundation. |
While both Presto and Trino serve similar purposes, the TRINO community has emerged as particularly vibrant, attracting a diverse array of contributors and user scenarios. I’ve found the activity on the Trino community Slack invaluable for staying informed about the latest improvements and real-world applications.
Why Use PRESTO?
Presto offers a unique advantage by allowing you to query structured and unstructured data sources with high performance. This makes it an excellent choice for enterprises needing flexibility in handling mixed data formats while retaining SQL’s familiar syntax. Moreover, its distributed nature and scalability mean it can handle massive datasets efficiently, even with complex joins and aggregations.
Installation Process: Getting PrestoDB on Your Mac
One can follow the step by step installation process as described by the Presto 0.247 Documentation from here.
As you all are aware I use Homebrew for all my installation wherever possible. I will be doing the same for this one as well.
brew install presto
If presto db is already installed. You will can either reinstall it by
brew reinstall prestodb
or upgrade it by
brew upgrade prestodb
Installation log:
==> Downloading https://search.maven.org/remotecontent?filepath=com/facebook/presto/presto-cli/0.247/presto-cli-0.247-executable.jar
==> Downloading from https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.247/presto-cli-0.247-executable.jar
######################################################################## 100.0%
==> Downloading https://search.maven.org/remotecontent?filepath=com/facebook/presto/presto-server/0.247/presto-server-0.247.tar.gz
==> Downloading from https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.247/presto-server-0.247.tar.gz
######################################################################## 100.0%
==> Reinstalling prestodb
==> Caveats
Add connectors to /usr/local/opt/prestodb/libexec/etc/catalog/. See:
https://prestodb.io/docs/current/connector.html
To have launchd start prestodb now and restart at login:
brew services start prestodb
Or, if you don't want/need a background service you can just run:
presto-server run
==> Summary
? /usr/local/Cellar/prestodb/0.247: 1,537 files, 940.2MB, built in 32 seconds
As part of the installation process. The following files are created at the
/usr/local/opt/prestodb/libexec/etc
folder of the home directory.
kinshukdutta@Kinshuks-MacBook-Pro-15 ~ % cd /usr/local/opt/prestodb/libexec/etc
kinshukdutta@Kinshuks-MacBook-Pro-15 etc % ls -l
total 32
drwxr-xr-x 3 kinshukdutta admin 96 Feb 16 21:38 catalog
-rw-r--r-- 1 kinshukdutta admin 201 Feb 16 21:38 config.properties
-rw-r--r-- 1 kinshukdutta admin 205 Feb 16 21:38 jvm.config
-rw-r--r-- 1 kinshukdutta admin 24 Feb 16 21:38 log.properties
-rw-r--r-- 1 kinshukdutta admin 114 Feb 16 21:38 node.properties
It also creates a sample JMX connector under the etc/catalog folder.
kinshukdutta@Kinshuks-MacBook-Pro-15 catalog % ls -l
total 8
-rw-r--r-- 1 kinshukdutta admin 18 Feb 16 21:38 jmx.properties
Next, add a connector. Here’s the list of available ones.
For PostgreSQL, create
/usr/local/Cellar/prestodb/0.247/libexec/etc/postgresql.properties
with:
connector.name=postgresql
connection-url=jdbc:postgresql://localhost:5432/externalDatabase
connection-user=postgres
connection-password=postgres
Starting PrestoDB
brew services start prestodb
==> Successfully started `prestodb` (label: homebrew.mxcl.prestodb)
Troubleshoot
Console log
JMX agent started and listening on Kinshuks-MacBook-Pro-15.local:64827
2021-02-17T11:53:37.234-0500 ERROR main com.facebook.presto.server.PrestoServer Unable to create injector, see the following errors:
1) Error in custom provider, java.lang.NullPointerException: Cannot invoke "com.facebook.airlift.discovery.client.AnnouncementHttpServerInfo.getHttpUri()" because "this.httpServerInfo" is null
at com.facebook.airlift.discovery.client.DiscoveryBinder.bindServiceAnnouncement(DiscoveryBinder.java:79)
while locating com.facebook.airlift.discovery.client.ServiceAnnouncement annotated with @com.google.inject.internal.Element(setName=,uniqueId=365, type=MULTIBINDER, keyType=)
while locating java.util.Set<com.facebook.airlift.discovery.client.ServiceAnnouncement>
for the 2nd parameter of com.facebook.airlift.discovery.client.Announcer.<init>(Announcer.java:68)
at com.facebook.airlift.discovery.client.DiscoveryModule.configure(DiscoveryModule.java:64)
while locating com.facebook.airlift.discovery.client.Announcer
for the 2nd parameter of com.facebook.airlift.discovery.client.DiscoveryModule.createMergingServiceSelectorFactory(DiscoveryModule.java:122)
at com.facebook.airlift.discovery.client.DiscoveryModule.createMergingServiceSelectorFactory(DiscoveryModule.java:122)
while locating com.facebook.airlift.discovery.client.MergingServiceSelectorFactory
at com.facebook.airlift.discovery.client.DiscoveryModule.configure(DiscoveryModule.java:73)
while locating com.facebook.airlift.discovery.client.ServiceSelectorFactory
for the 1st parameter of com.facebook.airlift.discovery.client.ServiceSelectorProvider.setServiceSelectorFactory(ServiceSelectorProvider.java:50)
at com.facebook.airlift.event.client.HttpEventModule.configure(HttpEventModule.java:43)
Caused by: java.lang.NullPointerException: Cannot invoke "com.facebook.airlift.discovery.client.AnnouncementHttpServerInfo.getHttpUri()" because "this.httpServerInfo" is null
at com.facebook.airlift.discovery.client.DiscoveryBinder$HttpAnnouncementProvider.get(DiscoveryBinder.java:128)
at com.facebook.airlift.discovery.client.DiscoveryBinder$HttpAnnouncementProvider.get(DiscoveryBinder.java:108)
Followed the solution as per the link and updated the postgresql.properties by removing the <databaseName>
Restarted the server. But this time, I was facing a different error. In terminal
Caused by: java.net.BindException: Address already in use
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Net.java:550)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:249)
at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:88)
at com.facebook.airlift.http.server.HttpServerInfo.createChannel(HttpServerInfo.java:160)
... 43 more
By default, most Java web application servers are listening on port 8080, and it can easily cause the popular 8080 port conflict error. In Mac OSX, you can use sudo lsof -i :8080 | grep LISTEN
to find out what program is listening on port 8080 :
In terminal
lsof -i :8080 | grep LISTEN
java 46655 kinshukdutta 45u IPv6 0x451da3ea396614ab 0t0 TCP *:http-alt (LISTEN)
The name “java” doesn’t tell you anything, to get the detail, ps
the java PID 46655 like this :
% ps -ef 46655
UID PID PPID C STIME TTY TIME CMD
501 46655 1 0 11:53AM ?? 0:05.14 /usr/local/opt/openjdk/bin/java -Dzookeeper.log.dir=/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../logs -Dzookeeper.log.file=zookeeper-kinshukdutta-server-Kinshuks-MacBook-Pro-15.local.log -Dzookeeper.root.logger=INFO,CONSOLE -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp /usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../zookeeper-server/target/classes:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../build/classes:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../zookeeper-server/target/lib/*.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../build/lib/*.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/zookeeper-prometheus-metrics-3.6.2.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/zookeeper-jute-3.6.2.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/zookeeper-3.6.2.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/snappy-java-1.1.7.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/slf4j-log4j12-1.7.25.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/slf4j-api-1.7.25.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/simpleclient_servlet-0.6.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/simpleclient_hotspot-0.6.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/simpleclient_common-0.6.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/simpleclient-0.6.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-transport-native-unix-common-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-transport-native-epoll-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-transport-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-resolver-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-handler-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-common-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-codec-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/netty-buffer-4.1.50.Final.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/metrics-core-3.2.5.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/log4j-1.2.17.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/json-simple-1.1.1.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jline-2.14.6.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-util-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-servlet-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-server-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-security-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-io-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jetty-http-9.4.24.v20191120.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/javax.servlet-api-3.1.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jackson-databind-2.10.3.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jackson-core-2.10.3.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/jackson-annotations-2.10.3.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/commons-lang-2.6.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/commons-cli-1.2.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../lib/audience-annotations-0.5.0.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../zookeeper-*.jar:/usr/local/Cellar/zookeeper/3.6.2_1/libexec/bin/../zookeeper-server/src/main/resources/lib/*.jar:/usr/local/etc/zookeeper: -Xmx1000m -Dapple.awt.UIElement=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/local/etc/zookeeper/zoo.cfg
The Zookeeper is using the port 8080. I stopped both KAFKA and zookeeper.
brew services stop kafka
Stopping `kafka`... (might take a while)
==> Successfully stopped `kafka` (label: homebrew.mxcl.kafka)
kinshukdutta@Kinshuks-MacBook-Pro-15 catalog % brew services stop zookeeper
Stopping `zookeeper`... (might take a while)
==> Successfully stopped `zookeeper` (label: homebrew.mxcl.zookeeper)
Restarted PrestoDB
presto-server run
This time got the success message in terminal
2021-02-23T12:20:18.426-0500 INFO main com.facebook.presto.server.PluginManager Registering connector tpch
2021-02-23T12:20:18.429-0500 INFO main com.facebook.presto.server.PluginManager -- Finished loading plugin /usr/local/var/presto/data/plugin/tpch --
2021-02-23T12:20:18.450-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog properties etc/catalog/postgresql.properties --
2021-02-23T12:20:18.452-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog postgresql --
2021-02-23T12:20:19.135-0500 INFO main Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
2021-02-23T12:20:19.135-0500 INFO main Bootstrap case-insensitive-name-matching false false
2021-02-23T12:20:19.135-0500 INFO main Bootstrap case-insensitive-name-matching.cache-ttl 1.00m 1.00m
2021-02-23T12:20:19.135-0500 INFO main Bootstrap connection-password [REDACTED] [REDACTED]
2021-02-23T12:20:19.135-0500 INFO main Bootstrap connection-url ---- jdbc:postgresql://localhost:5432
2021-02-23T12:20:19.135-0500 INFO main Bootstrap connection-user ---- postgres
2021-02-23T12:20:19.135-0500 INFO main Bootstrap allow-drop-table false false Allow connector to drop tables
2021-02-23T12:20:19.475-0500 INFO main com.facebook.airlift.bootstrap.LifeCycleManager Life cycle starting...
2021-02-23T12:20:19.475-0500 INFO main com.facebook.airlift.bootstrap.LifeCycleManager Life cycle startup complete. System ready.
2021-02-23T12:20:19.477-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Added catalog postgresql using connector postgresql --
2021-02-23T12:20:19.477-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog properties etc/catalog/jmx.properties --
2021-02-23T12:20:19.478-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog jmx --
2021-02-23T12:20:20.041-0500 INFO main Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
2021-02-23T12:20:20.041-0500 INFO main Bootstrap jmx.dump-period 10.00s 10.00s
2021-02-23T12:20:20.041-0500 INFO main Bootstrap jmx.dump-tables [] []
2021-02-23T12:20:20.041-0500 INFO main Bootstrap jmx.max-entries 86400 86400
2021-02-23T12:20:20.371-0500 INFO main com.facebook.presto.metadata.StaticCatalogStore -- Added catalog jmx using connector jmx --
2021-02-23T12:20:20.393-0500 INFO main com.facebook.presto.security.AccessControlManager -- Loading system access control --
2021-02-23T12:20:20.395-0500 INFO main com.facebook.presto.security.AccessControlManager -- Loaded system access control allow-all --
2021-02-23T12:20:20.395-0500 INFO main com.facebook.presto.storage.TempStorageManager -- Loading temp storage --
2021-02-23T12:20:20.399-0500 INFO main com.facebook.presto.storage.TempStorageManager -- Loaded temp storage local --
2021-02-23T12:20:20.434-0500 INFO main com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========
Accessing the Web UI:
After starting presto server, you can access the web UI using this link
Command-line interface CLI
The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.
We can use the CLI to connect to the presto server.
cd /usr/local/Cellar/prestodb/0.247/libexec
kinshukdutta@Kinshuks-MacBook-Pro-15 libexec % ls -l
total 29960
drwxr-xr-x 6 kinshukdutta admin 192 Feb 9 16:18 bin
drwxr-xr-x 7 kinshukdutta admin 224 Feb 23 09:28 etc
drwxr-xr-x 169 kinshukdutta admin 5408 Feb 23 09:27 lib
drwxr-xr-x 33 kinshukdutta admin 1056 Feb 23 09:28 plugin
-rw-r--r-- 1 kinshukdutta staff 15335794 Feb 9 16:26 presto-cli-0.247-executable.jar
As a best practice. Take a backup of the original jar. Then make it executable with chmod +x
, and run it.
On terminal
ls -l
total 59920
drwxr-xr-x 6 kinshukdutta admin 192 Feb 9 16:18 bin
drwxr-xr-x 7 kinshukdutta admin 224 Feb 23 09:28 etc
drwxr-xr-x 169 kinshukdutta admin 5408 Feb 23 09:27 lib
drwxr-xr-x 33 kinshukdutta admin 1056 Feb 23 09:28 plugin
-rw-r--r-- 1 kinshukdutta admin 15335794 Feb 23 14:23 presto-cli-0.247-executable-ori.jar
-rwxr-xr-x 1 kinshukdutta staff 15335794 Feb 9 16:26 presto-cli-0.247-executable.jar
execute the CLI.
./presto-cli-0.247-executable.jar --server localhost:8080 --catalog postgresql --schema public
presto:public>
Basic Operations
Use Case 1: Transferring selective records from one table to another.
Create a Table
presto:public> CREATE TABLE orders ( orderkey bigint, orderstatus varchar, totalprice double, orderdate date ) ;
CREATE TABLE
presto:public>
Create another table for active orders.
presto:public> CREATE TABLE activeorders ( orderkey bigint, orderstatus varchar, totalprice double, orderdate date );
CREATE TABLE
Using pgAdmin4 inserted data to the order table. Some orders are active, some are inactive.
The goal is to now fetch the active orders and push them into active orders table.
presto:public> INSERT INTO activeorders SELECT * FROM orders WHERE orders.orderstatus='active'; INSERT: 25 rows Query 20210223_224621_00018_g3pg4, FINISHED, 1 node Splits: 19 total, 19 done (100.00%) 0:01 [25 rows, 0B] [22 rows/s, 0B/s] presto:public>
For more specific guidance, check out Presto’s troubleshooting documentation.
Interactive Queries and Sample Use Case
One of the most compelling aspects of Presto is its ability to seamlessly join data from different sources. Here’s a simple use case to illustrate how you might use Presto to transfer selective records between tables in a PostgreSQL database:
Example: Transfer Active Orders to a New Table
The ease with which Presto allows users to move and analyze data makes it a powerful tool for interactive analytics.
Expanding on Community Resources
To deepen your knowledge and stay connected with others in the Presto ecosystem, I recommend joining the Prestocommunity on Slack. Here, you’ll find active discussions, support from fellow users, and updates on new developments.
As I continue this series, I aim to cover more advanced topics, including optimizing queries for performance and integrating Presto with other tools like Kafka for real-time analytics. By sharing these insights, I hope to make Presto more accessible and useful for both newcomers and seasoned data engineers.
Conclusion
Presto is a versatile and high-performance choice for big data analytics, especially in environments where querying multiple data sources in real-time is crucial. This series will continue to explore advanced functionalities and practical use cases to help you harness the full potential of Presto.
Stay tuned for the next post, where I’ll delve into query optimizations and performance tuning strategies. Join me on this journey as we uncover the power of distributed SQL with Presto!
Note on Terminology: Presto and Trino
In this series, you may see us use “Presto” and “Trino” interchangeably. This is because, in December 2020, PrestoSQL—the version most commonly used by the open-source community—was officially rebranded as Trino. This rebranding clarified the distinction between PrestoDB, which is managed by the Presto Foundation and rooted in the original Facebook project, and Trino, the community-driven evolution of PrestoSQL, led by the same key developers who initially built Presto.
We’ll primarily refer to Trino going forward, as it reflects the ongoing work and advancements made by the broader open-source community. However, the fundamentals remain the same, and much of the knowledge applies to both PrestoDB and Trino. This approach allows us to keep the series relevant while acknowledging Trino as the most up-to-date and actively maintained branch of the technology.
Well structured essay with useful insight. Keep posting more