Pinot Series - Data-Nizant

Apache Pinot Series Summary: Real-Time Analytics for Modern Business Needs

This entry is part 1 of 6 in the series Pinot Series

Over the past few months, we’ve explored the capabilities of Apache Pinot as a powerful real-time analytics engine. From basic setup to advanced configurations, this series has covered the essential steps to building robust, low-latency analytics solutions. Below is a summary of each blog post in the series, along with some real-world use cases demonstrating how companies use Pinot to address critical business challenges. Series Overview and Links Here’s a quick recap of the posts in this series, with links and publication dates: Pinot™ Basics Published: February 27, 2021 Introduction to Apache Pinot’s core features and initial setup, with guidance on using Pinot for basic analytics use cases. Advanced Apache Pinot: Sample Project and Industry Use Cases Published: November 16, 2023 Building a sample project for real-time analytics, with industry scenarios and a step-by-step guide to configuring Pinot with a Kafka-powered data ingestion pipeline. Production-Ready Apache Pinot: Deployment and Integration with Iceberg Published: November 30, 2023 Transitioning to a production deployment, leveraging Iceberg for long-term storage, and handling complex data retention and security needs. Advanced Apache Pinot: Custom Aggregations, Transformations, and Real-Time Enrichment Published: December 28, 2023 Finalizing the project with advanced aggregations, custom transformations, and real-time data enrichment to support deeper analytics and faster querying. Each post builds on the previous one, offering a roadmap for using Apache Pinot in real-world, production-grade scenarios. Real-Life Business Use Cases of Apache Pinot Many organizations rely on Apache Pinot to power real-time analytics applications. Here are a few standout examples, along with links to the stories shared by the companies themselves: 1. LinkedIn: Real-Time Insights into User Activity LinkedIn originally developed Pinot to provide real-time analytics on user interactions, powering features like Who Viewed My Profile and Job Recommendations. Pinot’s low-latency query capabilities allow LinkedIn to offer up-to-the-second insights into user engagement on a massive scale. Learn More: LinkedIn’s Real-Time Analytics with Apache Pinot 2. Uber: Real-Time Monitoring and Decision-Making Uber uses Apache Pinot to power real-time monitoring dashboards, helping the team stay informed of system health and performance metrics across its global operations. Pinot allows Uber to handle millions of events per second, ensuring the company’s monitoring systems are both fast and scalable. Learn More: Uber Engineering’s Use of Pinot for Real-Time Monitoring 3. Stripe: Financial Reporting and Transaction Insights Stripe relies on Pinot to provide real-time insights into transaction data, enabling businesses to monitor their financial health with up-to-the-minute accuracy. With Pinot, Stripe can generate complex reports on massive data sets with minimal latency, allowing customers to make informed business decisions. Learn More: How Stripe Uses Pinot for Real-Time Financial Analytics 4. Walmart: Enhancing the E-Commerce Experience Walmart uses Pinot to analyze customer behavior on its e-commerce platform in real-time, ensuring product recommendations and search results are always relevant. This has improved customer experience by delivering tailored recommendations based on customer interaction data as it happens. Learn More: Walmart’s Use of Pinot for Real-Time E-Commerce Analytics 5. WePay: Fraud Detection and Transaction Monitoring WePay leverages Pinot to monitor transactions in real time, helping detect and prevent fraud. Pinot’s low-latency capabilities enable WePay to analyze patterns as they emerge, offering a vital layer of protection against financial fraud. Learn More: WePay’s Real-Time Analytics for Fraud Detection These examples illustrate the versatility and robustness of Apache Pinot across industries, from e-commerce and finance to technology and transportation. Conclusion and What’s Next As outlined in the Druid summary blog, our next post will provide a comprehensive comparison between Apache Druid and Apache Pinot. Both tools are highly effective for real-time analytics, yet each offers distinct advantages and trade-offs. This upcoming blog will explore: Strengths and Weaknesses of Druid and Pinot When to Choose One Over the Other, based on data needs and query types Specific Scenarios where each shines in real-world applications Stay tuned as we delve into Druid vs. Pinot, offering a practical guide on when to use each for optimal performance in analytics workloads. Thank you for joining us on this journey through Apache Pinot, and we look forward to diving deeper into real-time analytics in the next comparison post!

Advanced Apache Pinot: Custom Aggregations, Transformations, and Real-Time Enrichment

This entry is part 2 of 6 in the series Pinot Series

Originally published on December 28, 2023 In this concluding post of the Apache Pinot series, we’ll explore advanced data processing techniques in Apache Pinot, such as custom aggregations, real-time transformations, and data enrichment. These techniques help us build a more intelligent and insightful analytics solution. As we finalize this series, we’ll also look ahead to how Apache Pinot could evolve with advancements in AI and ModelOps, laying a foundation for future exploration. Sample Project Enhancements for Real-Time Enrichment We’ll take our social media analytics project to the next level with real-time data transformations, custom aggregations, and enrichment. These advanced techniques help preprocess and structure data during ingestion, making it ready for faster querying and deeper insights. Final Project Structure: data: Sample enriched data with aggregations and transformations. config: Schema and table configurations with transform functions. scripts: Automated scripts to manage data transformations. enrichment: Real-time enrichment scripts for joining external data sources, such as demographic information or location data. Implementing Custom Aggregations in Pinot Custom aggregations allow us to compute more complex metrics during ingestion, helping reduce query load and making results available immediately. Adding a Moving Average Metric Define the Metric in the Schema: Add fields for moving averages of likes and shares to capture trends over time. json Copy code { “metricFieldSpecs”: [ {“name”: “likes”, “dataType”: “INT”}, {“name”: “shares”, “dataType”: “INT”}, {“name”: “moving_avg_likes”, “dataType”: “FLOAT”}, {“name”: “moving_avg_shares”, “dataType”: “FLOAT”} ] } Transformation Function for Moving Average: Configure a transformation function to calculate a 7-day moving average: json Copy code “ingestionConfig”: { “transformConfigs”: [ { “columnName”: “moving_avg_likes”, “transformFunction”: “AVG_OVER(‘likes’, 7)” }, { “columnName”: “moving_avg_shares”, “transformFunction”: “AVG_OVER(‘shares’, 7)” } ] } Real-Time Percentile Calculation Define a Percentile Metric: Pinot’s Percentile Estimation function can calculate real-time engagement levels or popularity for posts, based on likes. json Copy code { “metricFieldSpecs”: [ {“name”: “percentile_likes”, “dataType”: “FLOAT”} ], “transformConfigs”: [ { “columnName”: “percentile_likes”, “transformFunction”: “PERCENTILEEST(likes, 90)” } ] } These aggregations are useful for quickly querying complex metrics without overloading Pinot with computation. Real-Time Data Transformation: Adding Enrichment Layers For applications requiring contextual insights, real-time data enrichment adds valuable dimensions to incoming data. Enriching Data with External Sources To illustrate, we’ll enhance user interactions with location-based data for region-focused analysis. Join with Location Dataset: Using a real-time Spark job, join events with an external location dataset based on geo_location. Feed both raw and enriched data points into Pinot. json Copy code “ingestionConfig”: { “streamIngestionConfig”: { “type”: “kafka”, “config”: { “stream.kafka.topic.name”: “social_media_enriched_events” } }, “transformConfigs”: [ { “columnName”: “country”, “transformFunction”: “LOOKUP(‘geo_location’, ‘location_data.csv’, ‘country’)” }, { “columnName”: “city”, “transformFunction”: “LOOKUP(‘geo_location’, ‘location_data.csv’, ‘city’)” } ] } Calculating Engagement Scores with Custom Formulas Custom fields can calculate engagement scores that weigh various interactions. json Copy code “metricFieldSpecs”: [ {“name”: “engagement_score”, “dataType”: “FLOAT”} ], “transformConfigs”: [ { “columnName”: “engagement_score”, “transformFunction”: “SUM(likes, shares) * 0.7 + SUM(comments) * 1.3” } ] With these custom metrics, you gain flexibility in calculating a composite engagement score, which can offer deeper insights into the types of content users find most engaging. Data Flow and Architecture Diagram for Real-Time Enrichment and Aggregation With real-time enrichment and custom aggregations in place, the data flow evolves to accommodate additional transformations. Data Ingestion: Events are ingested from Kafka and enriched with auxiliary datasets, such as demographic or location information. Aggregation and Transformation: Custom aggregations, like moving averages and percentiles, are calculated. Storage and Querying: Enriched data is stored in Pinot segments, ready for fast querying. Query Execution: With precomputed metrics available, Pinot can return complex insights instantly. Example Queries for Enriched and Aggregated Data These advanced queries leverage custom metrics for enriched data analysis. Engagement Analysis by Location sql Copy code SELECT country, AVG(engagement_score) AS avg_engagement FROM social_media_v4 GROUP BY country ORDER BY avg_engagement DESC LIMIT 5; Moving Average of Likes Over Time sql Copy code SELECT timestamp, moving_avg_likes FROM social_media_v4 ORDER BY timestamp DESC LIMIT 10; Top Percentile of Most Liked Posts sql Copy code SELECT post_id, percentile_likes FROM social_media_v4 WHERE percentile_likes > 90 ORDER BY percentile_likes DESC LIMIT 10; Looking to the Future: AI, ModelOps, and Apache Pinot With these enhancements, our Apache Pinot project is ready for production-grade analytics. However, as the fields of AI and ModelOps continue to evolve, there are exciting possibilities for integrating machine learning and predictive analytics directly into Pinot. Some potential future directions include: Real-Time ML Model Scoring: Integrate machine learning models within Pinot’s data pipeline to score events in real time, enabling predictive analytics on-the-fly. Automated Anomaly Detection: With AI-driven anomaly detection, Pinot could flag unusual patterns in data as they occur. Integration with ModelOps Platforms: As ModelOps tools and pipelines mature, Pinot could serve as the foundational layer for deploying and serving machine learning models in analytics environments. While this series concludes here, we may revisit it in the future as advancements in AI and ModelOps unfold, bringing new capabilities to real-time analytics with Apache Pinot. Conclusion In this series, we’ve journeyed from the fundamentals of Apache Pinot to advanced configurations, custom aggregations, real-time data transformations, and production deployments. As data analytics demands evolve, Apache Pinot stands ready to handle scalable, low-latency querying for both traditional and advanced analytics. Thank you for following along, and stay tuned for future updates as we explore emerging opportunities in AI-powered real-time analytics with Apache Pinot!

Apache Pinot for Production: Deployment and Integration with Apache Iceberg

This entry is part 3 of 6 in the series Pinot Series

Originally published on December 14, 2023 In this installment of the Apache Pinot series, we’ll guide you through deploying Pinot in a production environment, integrating with Apache Iceberg for efficient data management and archival, and ensuring that the system can handle real-world, large-scale datasets. With Iceberg as the long-term storage layer and Pinot handling real-time analytics, you’ll have a powerful combination for managing both recent and historical data. For those interested in brushing up on Presto concepts, check out my detailed Presto Basics blog post. If you’re new to Apache Iceberg, you can find an introductory guide in my Apache Iceberg Basics blog post. Sample Project Enhancements for Production-Readiness To make our social media analytics project production-ready, we’ll add Iceberg as an archival solution for storing large datasets efficiently. This setup allows us to offload historical data from Pinot to Iceberg, which can be queried when needed while keeping Pinot lean and responsive for real-time analytics. Updated Project Structure: data: Simulated large-scale datasets for testing production performance. config: Production-ready schema and table configurations, with an Iceberg data sink. scripts: Automated scripts for setting up the Iceberg table and managing data movement. monitoring: Metrics and monitoring configurations to track data flow between Pinot and Iceberg. Deploying Apache Pinot with Iceberg for Data Archival This setup involves deploying Iceberg on a data lake (e.g., S3, HDFS, or ADLS) and configuring Pinot to store recent data, with older data regularly offloaded to Iceberg for cost-efficient storage. 1. Setting Up Zookeeper and Kafka Clusters Use Kubernetes for high availability in Zookeeper and Kafka deployments, as explained in the previous blog. Zookeeper coordinates Pinot nodes, while Kafka handles real-time data ingestion. 2. Deploying Pinot and Iceberg on Kubernetes We’ll deploy Pinot in the same way, but with the addition of Iceberg, we’ll create a new workflow for data archival and retrieval. Deploy Pinot: Follow the same configurations as in the previous blog for deploying Pinot components (Controller, Broker, and Server) on Kubernetes. Deploy Iceberg: Set up Iceberg on an object storage system like Amazon S3, HDFS, or a local file system (for testing). 3. Configuring Iceberg as the Archival Layer To configure Iceberg as a storage layer, we’ll use a batch job to move historical data from Pinot to Iceberg regularly (e.g., every 90 days). Configure an Archival Job: Write a Spark job to query historical segments from Pinot and move them to Iceberg. Use the Iceberg Spark connector to write Pinot data into an Iceberg table. Iceberg Table Schema: Ensure that the schema in Iceberg matches the schema in Pinot, allowing seamless data transfer and querying. Using Apache Iceberg for Data Retention and Cost-Effective Storage With Iceberg in the data lake, you can define data retention policies directly within Iceberg, which allows for schema evolution, partitioning, and management of large volumes of historical data. Defining a Retention Policy in Iceberg To manage historical data in Iceberg: Time-Based Partitioning: Partition data by date in Iceberg, making it easy to manage and query data based on time. sql Copy code ALTER TABLE social_media ADD PARTITION FIELD date_trunc(‘day’, timestamp) AS day; Automated Data Archival: Schedule a batch Spark job to archive Pinot segments older than 90 days into Iceberg. Optimize Iceberg Storage: Use data compaction and metadata pruning in Iceberg to improve query performance and storage efficiency over time. Querying Iceberg and Pinot Together In this setup, Pinot will handle real-time data queries, while Iceberg serves as the historical data store. You can use Trino(formerly PrestoSQL) to perform federated queries that span both Pinot and Iceberg: Example Querying Both Pinot and Iceberg Real-Time Analytics in Pinot: Query recent data in Pinot, leveraging its low-latency capabilities for immediate insights. sql Copy code SELECT geo_location, SUM(likes) AS total_likes FROM social_media_v3 WHERE timestamp > CURRENT_TIMESTAMP – INTERVAL ’90’ DAY GROUP BY geo_location; Historical Analytics in Iceberg: Query archived data from Iceberg directly for long-term trends. sql Copy code SELECT geo_location, SUM(likes) AS total_likes FROM iceberg_social_media WHERE timestamp < CURRENT_TIMESTAMP – INTERVAL ’90’ DAY GROUP BY geo_location; Federated Query with Trino: Combine results from Pinot and Iceberg in a single query with Trino to get a unified view across real-time and historical data. sql Copy code SELECT geo_location, SUM(total_likes) AS total_likes FROM ( SELECT geo_location, SUM(likes) AS total_likes FROM social_media_v3 WHERE timestamp > CURRENT_TIMESTAMP – INTERVAL ’90’ DAY UNION ALL SELECT geo_location, SUM(likes) AS total_likes FROM iceberg_social_media WHERE timestamp < CURRENT_TIMESTAMP – INTERVAL ’90’ DAY ) GROUP BY geo_location; Enhanced Data Flow and Architecture with Iceberg Integration In a production environment with Iceberg, the data flow supports seamless transitions from real-time analytics in Pinot to archival storage in Iceberg. Data Ingestion: Real-time events are streamed into Kafka and ingested by Pinot. Real-Time Querying: Pinot provides low-latency responses for recent data (e.g., last 90 days). Archival with Iceberg: Historical data is regularly moved from Pinot to Iceberg for cost-effective storage and long-term querying. Unified Querying with Trino: Using Trino, you can query across both real-time data in Pinot and historical data in Iceberg. Conclusion In this post, we covered how to deploy Apache Pinot in production with Apache Iceberg for managing historical data. This setup allows you to maintain efficient, cost-effective data storage while still benefiting from Pinot’s real-time capabilities. In the next post, we’ll explore advanced data processing techniques with Pinot, including custom aggregations, transformations, and more complex data flows.

Advanced Apache Pinot: Optimizing Performance and Querying with Enhanced Project Setup

This entry is part 4 of 6 in the series Pinot Series

Originally published on November 30, 2023 In this third part of our Apache Pinot series, we’ll focus on performance optimization and query enhancements within our sample project. Now that we have a foundational setup, we’ll add new features for monitoring real-time data effectively, introducing optimizations that make queries faster and more efficient. Enhancing the Sample Project: Real-Time Analytics with Aggregations and Filtering In this version of the sample project, we’ll continue with our social media analytics setup, adding fields and optimizing tables to support complex aggregations and filtering on geo-location for more detailed insights. New Project Structure Enhancements: data: Additional sample data for testing optimizations. config: Revised schema and table configurations with new indices. scripts: New scripts for testing optimized queries and monitoring. monitoring: Configuration for setting up Pinot’s monitoring tools, tracking query times, and server metrics. Advanced Table Configuration for Better Performance Updated Data Schema with Additional Fields We’ll add fields such as interaction_type (like, comment, share), app_version, and device_type to allow more granular insights into user interactions. Update social_media_schema.json as follows: json Copy code { “schemaName”: “social_media_v2”, “dimensionFieldSpecs”: [ {“name”: “user_id”, “dataType”: “STRING”}, {“name”: “post_id”, “dataType”: “STRING”}, {“name”: “geo_location”, “dataType”: “STRING”}, {“name”: “interaction_type”, “dataType”: “STRING”}, {“name”: “app_version”, “dataType”: “STRING”}, {“name”: “device_type”, “dataType”: “STRING”} ], “metricFieldSpecs”: [ {“name”: “likes”, “dataType”: “INT”}, {“name”: “shares”, “dataType”: “INT”} ], “dateTimeFieldSpecs”: [ {“name”: “timestamp”, “dataType”: “LONG”, “format”: “1:MILLISECONDS:EPOCH”, “granularity”: “1:MILLISECONDS”} ] } Optimized Table Configuration with Star-Tree Index To support faster queries, we’ll enable Star-Tree indexing on frequently queried fields, such as interaction_type and geo_location. Update social_media_table.json as follows: json Copy code { “tableName”: “social_media_v2”, “tableType”: “REALTIME”, “segmentsConfig”: { “timeColumnName”: “timestamp”, “replication”: “1”, “segmentAssignmentStrategy”: “ReplicaGroupSegmentAssignmentStrategy” }, “tableIndexConfig”: { “loadMode”: “MMAP”, “noDictionaryColumns”: [“likes”, “shares”], “invertedIndexColumns”: [“geo_location”, “interaction_type”], “starTreeIndexConfigs”: [ { “dimensionsSplitOrder”: [“geo_location”, “interaction_type”], “skipStarNodeCreationForDimensions”: [“user_id”], “functionColumnPairs”: [“SUM__likes”, “SUM__shares”] } ] }, “ingestionConfig”: { “streamIngestionConfig”: { “type”: “kafka”, “config”: { “stream.kafka.broker.list”: “localhost:9092”, “stream.kafka.topic.name”: “social_media_events_v2”, “stream.kafka.consumer.type”: “simple”, “stream.kafka.decoder.class.name”: “org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder” } } } } New Project Feature: Query Monitoring and Performance Tracking An essential aspect of working with real-time analytics is the ability to monitor query performance and track server health. Setting Up Pinot Monitoring Enable Metrics Collection: Add Pinot’s built-in monitoring capabilities to collect metrics for query latency, memory usage, and segment health. Use Third-Party Monitoring: Connect Pinot’s metrics with Prometheus or Grafana for an intuitive visualization of query performance. Configure Pinot’s metrics to send data to a Prometheus instance, then visualize it with Grafana dashboards. Monitoring Setup Commands To run Prometheus and Grafana alongside Pinot: bash Copy code docker run -d -p 9090:9090 prom/prometheus docker run -d -p 3000:3000 grafana/grafana Enhanced Data Flow and Architecture with Star-Tree Indexing The graph below shows the performance gain and space cost for the different techniques. On the left side, we have indexing techniques that improve search time with limited increase in space, but do not guarantee a hard upper bound on query latencies because of the aggregation cost. On the right side, we have pre-aggregation techniques that offer hard upper bound on query latencies, but suffer from exponential explosion of storage space. Hence the Star-Tree data structure inspired by the star-cubing paper (Xin, Han, Li, & Wah, 2003) that offers a configurable trade-off between space and latency and allows us to achieve a hard upper bound for query latencies for a given use case is the preferred indexing technique. Data Flow with Star-Tree Index Optimizations The updated architecture now includes the Star-Tree index, a powerful way to pre-aggregate data and reduce response times for complex queries. Here’s how the data flows with this index enhancement: Data Generation: As users interact with the application, data events are pushed to Kafka, tagged with attributes like interaction_type and geo_location. Ingestion & Indexing: Pinot’s real-time ingestion pulls events from Kafka and pre-aggregates data into Star-Tree nodes, creating partial aggregations that drastically speed up queries on high-cardinality fields. Optimized Querying: The Star-Tree index enables faster aggregations, making queries for metrics such as likes, shares by geo_location or interaction_type almost instantaneous. Advanced Querying with Aggregations and Filtering With our schema optimized, let’s test more complex queries that illustrate Pinot’s capabilities. Examples of Optimized Queries Top Geo-Locations by Total Interactions: sql Copy code SELECT geo_location, SUM(likes + shares) AS total_interactions FROM social_media_v2 GROUP BY geo_location ORDER BY total_interactions DESC LIMIT 5; Most Active Device Types in the Past Week: sql Copy code SELECT device_type, COUNT(*) AS activity_count FROM social_media_v2 WHERE timestamp > CURRENT_TIMESTAMP – 7 * 86400000 GROUP BY device_type ORDER BY activity_count DESC LIMIT 3; Version-Based Engagement: sql Copy code SELECT app_version, AVG(likes + shares) AS avg_engagement FROM social_media_v2 GROUP BY app_version ORDER BY avg_engagement DESC LIMIT 5; Conclusion In this post, we explored advanced configurations in Apache Pinot, focusing on Star-Tree indexing and performance monitoring. These optimizations enhance our project’s ability to handle real-time analytics at scale, providing actionable insights almost instantly. By understanding these optimizations, you’re well on your way to building a robust, real-time analytics platform with Pinot. Stay tuned for the next post, where we’ll explore deploying Pinot in a production environment and integrating it with real-world data sources.

Advanced Apache Pinot: Sample Project and Industry Use Cases

This entry is part 5 of 6 in the series Pinot Series

As we dive deeper into Apache Pinot, this post will guide you through setting up a sample project. This hands-on project aims to demonstrate Pinot’s real-time data ingestion and query capabilities and provide insights into its application in industry scenarios. Whether you’re looking to power recommendation engines, enhance user analytics, or build custom BI dashboards, this blog will help you establish a foundation with Apache Pinot. Introduction to the Sample Project The sample project will simulate a real-time analytics dashboard for a social media application. We’ll analyze user interactions in near-real-time, covering a setup from data ingestion through to visualization. Project Structure: data: Contains sample data files for bulk ingestion. ingestion: Configuration for real-time data ingestion via a Kafka stream. config: JSON/YAML configuration files for schema and table definitions in Pinot. scripts: Shell scripts for setting up the Pinot cluster and loading data. visualization: Dashboard setup files for BI tool integration, e.g., Superset. Architecture Overview of the Sample Project To give you a visual understanding of how components interact in this setup, here’s an architecture diagram showcasing data ingestion, storage, and querying in Apache Pinot. Diagram: Architecture of Real-Time Analytics with Apache Pinot Kafka: Receives real-time events from simulated user interactions. Pinot Ingestion: Configured with real-time ingestion settings to pull data directly from the Kafka stream. Pinot Storage: Organizes incoming data in segments, stored in a columnar format to enable high-speed analytics. Superset: Connects to Pinot as the analytics and visualization layer. Data Flow in Apache Pinot: From Ingestion to Querying Let’s break down the data flow, highlighting how a user interaction (e.g., a post like) becomes actionable insight in near-real-time. 1. Data Generation User interactions (like clicks, views, likes, etc.) are sent as events to Kafka. These events are structured as JSON payloads containing fields such as user_id, post_id, likes, and timestamp. 2. Ingestion into Pinot Kafka to Pinot: Pinot’s stream ingestion process is configured to read events directly from the social_media_events topic in Kafka. Each event is parsed, indexed, and stored in columnar segments. Segment Management: Pinot automatically manages segment creation and optimizes data for query performance. 3. Real-Time Querying Query Layer (Broker): Superset sends SQL queries to Pinot’s broker, which directs them to the appropriate servers. Aggregation and Filtering: The servers perform filtering and aggregation operations on the segments, leveraging Pinot’s indexing capabilities to ensure low latency. Result Delivery: Query results are returned to Superset, where they are visualized in real-time. Setting Up Your Environment If you followed the setup in Pinot Basics, you should already have a local Pinot instance running. Now, we’ll extend this setup to ingest real-time data using Kafka and visualize it with Superset. 1. Define the Data Schema We’ll be working with a simplified schema for tracking social media metrics. This schema will include fields like user_id, post_id, likes, shares, timestamp, and geo_location. Create a social_media_schema.json file in the config folder: json Copy code { “schemaName”: “social_media”, “dimensionFieldSpecs”: [ {“name”: “user_id”, “dataType”: “STRING”}, {“name”: “post_id”, “dataType”: “STRING”}, {“name”: “geo_location”, “dataType”: “STRING”} ], “metricFieldSpecs”: [ {“name”: “likes”, “dataType”: “INT”}, {“name”: “shares”, “dataType”: “INT”} ], “dateTimeFieldSpecs”: [ {“name”: “timestamp”, “dataType”: “LONG”, “format”: “1:MILLISECONDS:EPOCH”, “granularity”: “1:MILLISECONDS”} ] } 2. Configure the Table in Pinot Next, create a social_media_table.json file in the config directory to define the table and ingestion type (in this case, REALTIME). json Copy code { “tableName”: “social_media”, “tableType”: “REALTIME”, “segmentsConfig”: { “timeColumnName”: “timestamp”, “replication”: “1” }, “tableIndexConfig”: { “loadMode”: “MMAP”, “noDictionaryColumns”: [“likes”, “shares”], “invertedIndexColumns”: [“geo_location”] }, “ingestionConfig”: { “streamIngestionConfig”: { “type”: “kafka”, “config”: { “stream.kafka.broker.list”: “localhost:9092”, “stream.kafka.topic.name”: “social_media_events”, “stream.kafka.consumer.type”: “simple”, “stream.kafka.decoder.class.name”: “org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder” } } } } 3. Ingest Data with Kafka For real-time ingestion, we’ll simulate data streaming through a Kafka topic. You can set up a Kafka producer to publish events to the social_media_events topic. Use a simple Python script to simulate the event stream. python Copy code from kafka import KafkaProducer import json import time import randomproducer = KafkaProducer(bootstrap_servers=‘localhost:9092’) user_ids = [“user1”, “user2”, “user3”] post_ids = [“post1”, “post2”, “post3”] def generate_event(): event = { “user_id”: random.choice(user_ids), “post_id”: random.choice(post_ids), “likes”: random.randint(0, 100), “shares”: random.randint(0, 50), “timestamp”: int(time.time() * 1000), “geo_location”: “US” } return json.dumps(event).encode(‘utf-8’) while True: producer.send(‘social_media_events’, generate_event()) time.sleep(1) 4. Deploy and Verify the Pinot Table With Pinot running, deploy the social_media table by uploading the schema and table configurations: bash Copy code bin/pinot-admin.sh AddTable -tableConfigFile config/social_media_table.json -schemaFile config/social_media_schema.json Confirm data ingestion by querying the Pinot console: sql Copy code SELECT COUNT(*) FROM social_media; Visualizing the Data in Superset Pinot integrates seamlessly with Superset, allowing us to create interactive dashboards. Configure the Pinot Connection in Superset, as shown in the previous blog. Create a Dashboard with metrics like total likes, shares, and active users over time. This visual dashboard allows you to track real-time user engagement on the platform. Industry Scenarios and Use Cases Apache Pinot is a great fit for applications that need high-throughput, low-latency analytics. A few scenarios include: User Engagement Analytics: Real-time insights on user interactions for social media, gaming, and streaming platforms. Anomaly Detection: Monitoring financial transactions or system logs for unusual patterns. Log Analysis and Monitoring: Centralized log aggregation and query for IT operations. Conclusion In this blog, we covered a comprehensive setup for a sample project with Pinot, from schema configuration to real-time data ingestion with Kafka. With these skills, you can explore broader applications of Pinot in industry scenarios where real-time analytics are critical. Stay tuned for the next post, where we’ll go deeper into optimizing queries and managing Pinot clusters.

Pinot™ Basics

This entry is part 6 of 6 in the series Pinot Series

Weekend started, pored myself a glass of Long Meadow Ranch Anderson Valley Pinot Noir. It smelled like cherry cola, cinnamon, and a forest in autumn. Probably not the right time to think or even blog about OLAP. – Kinshuk Dutta Online analytical processing, or OLAP Is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture. The most popular Open Source OLAP Systems for Big Data that runs analytical queries over big volumes (terabytes-scale) of data with interactive latencies are: ClickHouse Druid & Pinot In this blog we will try to cover the basics for Pinot. Pinot supports full SQL. My main interest is to try out the Presto-Pinot connector developed by Uber Engineering for users to perform joins on data in Pinot. And ultimately would like to create a dashboard using Superset as front end (on Pinot as backend). Apache Pinot™ (Incubating) Pinot was first developed by LinkedIn in 2014 as an internal analytics infrastructure. It originated from the demands to scale out OLAP systems to support low-latency real-time queries on huge volume data. It was later open-sourced in 2015 and entered ApacheIncubator in 2018 At the time of writing this Blog Pinot™ is still in incubating stage under the Apache projects. It is a realtime distributed OLAP datastore, designed to answer OLAP queries with low latency. Which is proven at scale in LinkedIn powers 50+ user-facing apps and serving 100k+ queries If you are interested in knowing more about the Pinot story. I highly recommend reading the Introducing Apache Pinot 0.3.0 blog published on April 2020 by Mayank Srivastava from LinkedIn. Installation There are more than 1 way to run Pinot on your Mac. You can chose from either of the following: Running Pinot locally: Download from latest apache release and Install Locally Build From Source: Download from Source Build and Run Docker Running Pinot locally This quick start guide will help you bootstrap a Pinot standalone instance on your Mac. Download Apache Pinot Once you have the tar file, untarnished it navigate to directory containing the launcher scripts Build From Source This one did not work for me at the moment. Ever since I updated to macOS Big Sur. At the end of the section I have the errors. I will be fixing them and will update the post. Click here to skip this and go to the demo section. Clone a repo Change working directory to the downloaded repo Build Pinot Issues & Resolution