Big Data

Pinot™ Basics

This entry is part 6 of 6 in the series Pinot Series

Basic Concepts

Pinot is designed to deliver low latency queries on large datasets. In order to achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.

Raw data is broken into small data shards and each shard is converted into a unit known as a segment. One or more segments together form a table, which is the logical container for querying Pinot using SQL/PQL.

Pinot Storage Model

Pinot uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system. 

Pinot Storage Model Abstraction

Pinot Components

A Pinot cluster is comprised of multiple distributed system components. These components are useful to understand for operators that are monitoring system usage or are debugging an issue with a cluster deployment.

  • Controller
  • Server
  • Broker
  • Minion (optional)

The benefits of scale that make Pinot linearly scalable for an unbounded number of nodes is made possible through its integration with Apache Zookeeper and Apache Helix

Architecture

Pinot uses Apache Helix for cluster management. Helix is embedded as an agent within the different components and uses Apache Zookeeper for coordination and maintaining the overall cluster state and health.

Core components

Setting up a Pinot cluster

We’ll be using the quick-start scripts provided along with pinot distribution, which do the following:

  1. Set up the Pinot cluster QuickStartCluster
  2. Create a sample table and load sample data

The following quick start scripts are available –

Batch

Batch quick start creates the pinot cluster, creates an offline table baseballStats and pushes sample offline data to the table.

Run the Quick Demo

cd pinot-distribution/target/apache-pinot-incubating-*-SNAPSHOT-bin/apache-pinot-incubating-*-SNAPSHOT-binbin/quick-start-batch.sh
 bin/quick-start-batch.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/lib/pinot-all-0.6.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.6.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.pinot.spi.plugin.PluginClassLoader (file:/Users/kinshukdutta/PINOT/apache-pinot-incubating-0.6.0-bin/lib/pinot-all-0.6.0-jar-with-dependencies.jar) to method java.net.URLClassLoader.addURL(java.net.URL)
WARNING: Please consider reporting this to the maintainers of org.apache.pinot.spi.plugin.PluginClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
***** Starting Zookeeper, controller, broker and server *****
Executing command: StartZookeeper -zkPort 2123 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotZkDir
Start zookeeper at localhost:2123 in thread main
Executing command: StartController -clusterName QuickStartCluster -controllerHost null -controllerPort 9000 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotControllerDir0 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 0.021s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 0.025s since launch
Starting a Pinot [CONTROLLER] at 0.026s since launch
Feb 27, 2021 1:07:02 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:9000]
Feb 27, 2021 1:07:02 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Started Pinot [CONTROLLER] instance [Controller_10.0.0.47_9000] at 10.177s since launch
Executing command: StartBroker -brokerHost null -brokerPort 8000 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 10.178s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 10.178s since launch
Starting a Pinot [BROKER] at 10.179s since launch
Feb 27, 2021 1:07:08 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8000]
Feb 27, 2021 1:07:08 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-1] Started.
Started Pinot [BROKER] instance [Broker_10.0.0.47_8000] at 15.568s since launch
Executing command: StartServer -clusterName QuickStartCluster -serverHost null -serverPort 7000 -serverAdminPort 7500 -dataDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotServerDataDir0 -segmentDir /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/data/PinotServerSegmentDir0 -zkAddress localhost:2123
Executing command: StartServiceManager -clusterName QuickStartCluster -zkAddress localhost:2123 -port -1 -bootstrapServices []
Starting a Pinot [SERVICE_MANAGER] at 15.57s since launch
Started Pinot [SERVICE_MANAGER] instance [ServiceManager_10.0.0.47_-1] at 15.571s since launch
Starting a Pinot [SERVER] at 15.571s since launch
Feb 27, 2021 1:07:14 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:7500]
Feb 27, 2021 1:07:14 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-2] Started.
Started Pinot [SERVER] instance [Server_10.0.0.47_7000] at 20.869s since launch
***** Adding baseballStats table *****
Executing command: AddTable -tableConfigFile /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/configs/baseballStats_offline_table_config.json -schemaFile /var/folders/h_/w09jw5d53sl_70rc8q5tfdxw0000gn/T/1614449215222/configs/baseballStats_schema.json -controllerHost 10.0.0.47 -controllerPort 9000 -exec
{"status":"Table baseballStats_OFFLINE succesfully added"}
***** Launch data ingestion job to build index segment for baseballStats and push to controller *****
***** Waiting for 5 seconds for the server to fetch the assigned segment *****
***** Offline quickstart setup complete *****
Total number of documents in the table
Query : select count(*) from baseballStats limit 1
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select count(*) from baseballStats limit 1
count(*)		
97889		

***************************************************
Top 5 run scorers of all time 
Query : select playerName, sum(runs) from baseballStats group by playerName order by sum(runs) desc limit 5
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats group by playerName order by sum(runs) desc limit 5
playerName		sum(runs)		
John Joseph		11581.0		
Michael Joseph		7981.0		
James Edward		6083.0		
William Henry		5933.0		
William Joseph		5547.0		

***************************************************
Top 5 run scorers of the year 2000
Query : select playerName, sum(runs) from baseballStats where yearID=2000 group by playerName order by sum(runs) desc limit 5
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats where yearID=2000 group by playerName order by sum(runs) desc limit 5
playerName		sum(runs)		
Jose Antonio		231.0		
Mark David		205.0		
Rafael		189.0		
Jeffrey Robert		152.0		
Fernando		140.0		

***************************************************
Top 10 run scorers after 2000
Query : select playerName, sum(runs) from baseballStats where yearID>=2000 group by playerName order by sum(runs) desc limit 10
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, sum(runs) from baseballStats where yearID>=2000 group by playerName order by sum(runs) desc limit 10
playerName		sum(runs)		
Adrian		1820.0		
Jose Antonio		1692.0		
Rafael		1565.0		
Brian Michael		1500.0		
Alexander Emmanuel		1426.0		
Jose Alberto		1426.0		
Derek Sanderson		1390.0		
Carlos		1314.0		
Johnny David		1300.0		
Ichiro		1261.0		

***************************************************
Print playerName,runs,homeRuns for 10 records from the table and order them by yearID
Query : select playerName, runs, homeRuns from baseballStats order by yearID limit 10
Executing command: PostQuery -brokerHost 10.0.0.47 -brokerPort 8000 -queryType sql -query select playerName, runs, homeRuns from baseballStats order by yearID limit 10
playerName		runs		homeRuns		
Alfred L.		0		0		
Charles Roscoe		66		0		
Adrian Constantine		29		0		
Robert		9		0		
Arthur Algernon		28		0		
Douglas L.		28		2		
Francis Patterson		0		0		
Robert Edward		30		0		
Franklin Lee		13		0		
William		1		0		

***************************************************
You can always go to http://localhost:9000 to play around in the query console

Pinot Console

The incubator pinot console has 4 main section.

Cluster Manager UI

Incubator Pinot Console

Series Navigation<< Advanced Apache Pinot: Sample Project and Industry Use Cases