Pinot™ Basics
- Apache Pinot Series Summary: Real-Time Analytics for Modern Business Needs
- Advanced Apache Pinot: Custom Aggregations, Transformations, and Real-Time Enrichment
- Apache Pinot for Production: Deployment and Integration with Apache Iceberg
- Advanced Apache Pinot: Optimizing Performance and Querying with Enhanced Project Setup
- Advanced Apache Pinot: Sample Project and Industry Use Cases
- Pinot™ Basics
Basic Concepts
Pinot is designed to deliver low latency queries on large datasets. In order to achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.
Raw data is broken into small data shards and each shard is converted into a unit known as a segment. One or more segments together form a table, which is the logical container for querying Pinot using SQL/PQL.
Pinot Storage Model
Pinot uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system.
Pinot Components
A Pinot cluster is comprised of multiple distributed system components. These components are useful to understand for operators that are monitoring system usage or are debugging an issue with a cluster deployment.
- Controller
- Server
- Broker
- Minion (optional)
The benefits of scale that make Pinot linearly scalable for an unbounded number of nodes is made possible through its integration with Apache Zookeeper and Apache Helix.
Architecture
Pinot uses Apache Helix for cluster management. Helix is embedded as an agent within the different components and uses Apache Zookeeper for coordination and maintaining the overall cluster state and health.
Setting up a Pinot cluster
We’ll be using the quick-start scripts provided along with pinot distribution, which do the following:
- Set up the Pinot cluster
QuickStartCluster
- Create a sample table and load sample data
The following quick start scripts are available –
Batch
Batch quick start creates the pinot cluster, creates an offline table baseballStats
and pushes sample offline data to the table.
Run the Quick Demo
cd pinot-distribution/target/apache-pinot-incubating-*-SNAPSHOT-bin/apache-pinot-incubating-*-SNAPSHOT-binbin/quick-start-batch.sh
bin/quick-start-batch.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/lib/pinot-all-0.6.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.6.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.pinot.spi.plugin.PluginClassLoader (file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/lib/pinot-all-0.6.0-jar-with-dependencies.jar) to method java.net.URLClassLoader.addURL(java.net.URL)
WARNING: Please consider reporting this to the maintainers of org.apache.pinot.spi.plugin.PluginClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
***** Starting Zookeeper, controller, broker and server *****
Executing command: StartZookeeper -zkPort 2123 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotZkDir
Start zookeeper at localhost:2123 in thread main
Executing command: StartController -clusterName QuickStartCluster -controllerHost null -controllerPort 9000 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotControllerDir0 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 0.021s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 0.025s since launch
Starting a Pinot [CONTROLLER] at 0.026s since launch
Feb 27, 2021 1:07:02 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:9000]
Feb 27, 2021 1:07:02 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Started Pinot [CONTROLLER] instance [Controller_10.0.0.47_9000] at 10.177s since launch
Executing command: StartBroker -brokerHost null -brokerPort 8000 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 10.178s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 10.178s since launch
Starting a Pinot [BROKER] at 10.179s since launch
Feb 27, 2021 1:07:08 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8000]
Feb 27, 2021 1:07:08 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-1] Started.
Started Pinot [BROKER] instance [Broker_10.0.0.47_8000] at 15.568s since launch
Executing command: StartServer -clusterName QuickStartCluster -serverHost null -serverPort 7000 -serverAdminPort 7500 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotServerDataDir0 -segmentDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotServerSegmentDir0 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 15.57s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 15.571s since launch
Starting a Pinot [SERVER] at 15.571s since launch
Feb 27, 2021 1:07:14 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:7500]
Feb 27, 2021 1:07:14 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-2] Started.
Started Pinot [SERVER] instance [Server_10.0.0.47_7000] at 20.869s since launch
***** Adding baseballStats table *****
Executing command: AddTable -tableConfigFile /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/configs/baseballStats_offline_table_config.json -schemaFile /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/configs/baseballStats_schema.json -controllerHost 10.0.0.47 -controllerPort 9000 -exec
{"status":"Table baseballStats_OFFLINE succesfully added"}
***** Launch data ingestion job to build index segment for baseballStats and push to controller *****
***** Waiting for 5 seconds for the server to fetch the assigned segment *****
***** Offline quickstart setup complete *****
Total number of documents in the table
Query : select count(*) from baseballStats limit 1
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select count(*) from baseballStats limit 1
count(*)
97889
***************************************************
Top 5 run scorers of all time
Query : select playerName, sum(runs) from baseballStats group by playerName order by sum(runs) desc limit 5
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats group by playerName order by sum(runs) desc limit 5
playerName sum(runs)
John Joseph 11581.0
Michael Joseph 7981.0
James Edward 6083.0
William Henry 5933.0
William Joseph 5547.0
***************************************************
Top 5 run scorers of the year 2000
Query : select playerName, sum(runs) from baseballStats where yearID=2000 group by playerName order by sum(runs) desc limit 5
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats where yearID=2000 group by playerName order by sum(runs) desc limit 5
playerName sum(runs)
Jose Antonio 231.0
Mark David 205.0
Rafael 189.0
Jeffrey Robert 152.0
Fernando 140.0
***************************************************
Top 10 run scorers after 2000
Query : select playerName, sum(runs) from baseballStats where yearID>=2000 group by playerName order by sum(runs) desc limit 10
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats where yearID>=2000 group by playerName order by sum(runs) desc limit 10
playerName sum(runs)
Adrian 1820.0
Jose Antonio 1692.0
Rafael 1565.0
Brian Michael 1500.0
Alexander Emmanuel 1426.0
Jose Alberto 1426.0
Derek Sanderson 1390.0
Carlos 1314.0
Johnny David 1300.0
Ichiro 1261.0
***************************************************
Print playerName,runs,homeRuns for 10 records from the table and order them by yearID
Query : select playerName, runs, homeRuns from baseballStats order by yearID limit 10
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, runs, homeRuns from baseballStats order by yearID limit 10
playerName runs homeRuns
Alfred L. 0 0
Charles Roscoe 66 0
Adrian Constantine 29 0
Robert 9 0
Arthur Algernon 28 0
Douglas L. 28 2
Francis Patterson 0 0
Robert Edward 30 0
Franklin Lee 13 0
William 1 0
***************************************************
You can always go to http://localhost:9000 to play around in the query console
Pinot Console
The incubator pinot console has 4 main section.
Cluster Manager UI