Data Storage, OLAP

Advanced Apache Pinot: Optimizing Performance and Querying with Enhanced Project Setup

This entry is part 4 of 6 in the series Pinot Series

Originally published on November 30, 2023

In this third part of our Apache Pinot series, we’ll focus on performance optimization and query enhancements within our sample project. Now that we have a foundational setup, we’ll add new features for monitoring real-time data effectively, introducing optimizations that make queries faster and more efficient.


Enhancing the Sample Project: Real-Time Analytics with Aggregations and Filtering

In this version of the sample project, we’ll continue with our social media analytics setup, adding fields and optimizing tables to support complex aggregations and filtering on geo-location for more detailed insights.

New Project Structure Enhancements:

  • data: Additional sample data for testing optimizations.
  • config: Revised schema and table configurations with new indices.
  • scripts: New scripts for testing optimized queries and monitoring.
  • monitoring: Configuration for setting up Pinot’s monitoring tools, tracking query times, and server metrics.

Advanced Table Configuration for Better Performance

Updated Data Schema with Additional Fields

We’ll add fields such as interaction_type (like, comment, share), app_version, and device_type to allow more granular insights into user interactions. Update social_media_schema.json as follows:

json
{
"schemaName": "social_media_v2",
"dimensionFieldSpecs": [
{"name": "user_id", "dataType": "STRING"},
{"name": "post_id", "dataType": "STRING"},
{"name": "geo_location", "dataType": "STRING"},
{"name": "interaction_type", "dataType": "STRING"},
{"name": "app_version", "dataType": "STRING"},
{"name": "device_type", "dataType": "STRING"}
],
"metricFieldSpecs": [
{"name": "likes", "dataType": "INT"},
{"name": "shares", "dataType": "INT"}
],
"dateTimeFieldSpecs": [
{"name": "timestamp", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS"}
]
}

Optimized Table Configuration with Star-Tree Index

To support faster queries, we’ll enable Star-Tree indexing on frequently queried fields, such as interaction_type and geo_location. Update social_media_table.json as follows:

json
{
"tableName": "social_media_v2",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "timestamp",
"replication": "1",
"segmentAssignmentStrategy": "ReplicaGroupSegmentAssignmentStrategy"
},
"tableIndexConfig": {
"loadMode": "MMAP",
"noDictionaryColumns": ["likes", "shares"],
"invertedIndexColumns": ["geo_location", "interaction_type"],
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": ["geo_location", "interaction_type"],
"skipStarNodeCreationForDimensions": ["user_id"],
"functionColumnPairs": ["SUM__likes", "SUM__shares"]
}
]
},
"ingestionConfig": {
"streamIngestionConfig": {
"type": "kafka",
"config": {
"stream.kafka.broker.list": "localhost:9092",
"stream.kafka.topic.name": "social_media_events_v2",
"stream.kafka.consumer.type": "simple",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder"
}
}
}
}

New Project Feature: Query Monitoring and Performance Tracking

An essential aspect of working with real-time analytics is the ability to monitor query performance and track server health.

Setting Up Pinot Monitoring

  1. Enable Metrics Collection: Add Pinot’s built-in monitoring capabilities to collect metrics for query latency, memory usage, and segment health.
  2. Use Third-Party Monitoring: Connect Pinot’s metrics with Prometheus or Grafana for an intuitive visualization of query performance. Configure Pinot’s metrics to send data to a Prometheus instance, then visualize it with Grafana dashboards.

Monitoring Setup Commands

To run Prometheus and Grafana alongside Pinot:

bash
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3000:3000 grafana/grafana

Enhanced Data Flow and Architecture with Star-Tree Indexing

The graph below shows the performance gain and space cost for the different techniques. On the left side, we have indexing techniques that improve search time with limited increase in space, but do not guarantee a hard upper bound on query latencies because of the aggregation cost. On the right side, we have pre-aggregation techniques that offer hard upper bound on query latencies, but suffer from exponential explosion of storage space.

Hence the Star-Tree data structure inspired by the star-cubing paper (Xin, Han, Li, & Wah, 2003) that offers a configurable trade-off between space and latency and allows us to achieve a hard upper bound for query latencies for a given use case is the preferred indexing technique.

Data Flow with Star-Tree Index Optimizations

The updated architecture now includes the Star-Tree index, a powerful way to pre-aggregate data and reduce response times for complex queries. Here’s how the data flows with this index enhancement:

  1. Data Generation: As users interact with the application, data events are pushed to Kafka, tagged with attributes like interaction_type and geo_location.
  2. Ingestion & Indexing: Pinot’s real-time ingestion pulls events from Kafka and pre-aggregates data into Star-Tree nodes, creating partial aggregations that drastically speed up queries on high-cardinality fields.
  3. Optimized Querying: The Star-Tree index enables faster aggregations, making queries for metrics such as likes, shares by geo_location or interaction_type almost instantaneous.


Advanced Querying with Aggregations and Filtering

With our schema optimized, let’s test more complex queries that illustrate Pinot’s capabilities.

Examples of Optimized Queries

  • Top Geo-Locations by Total Interactions:
    sql
    SELECT geo_location, SUM(likes + shares) AS total_interactions
    FROM social_media_v2
    GROUP BY geo_location
    ORDER BY total_interactions DESC
    LIMIT 5;
  • Most Active Device Types in the Past Week:
    sql
    SELECT device_type, COUNT(*) AS activity_count
    FROM social_media_v2
    WHERE timestamp > CURRENT_TIMESTAMP - 7 * 86400000
    GROUP BY device_type
    ORDER BY activity_count DESC
    LIMIT 3;
  • Version-Based Engagement:
    sql
    SELECT app_version, AVG(likes + shares) AS avg_engagement
    FROM social_media_v2
    GROUP BY app_version
    ORDER BY avg_engagement DESC
    LIMIT 5;

Conclusion

In this post, we explored advanced configurations in Apache Pinot, focusing on Star-Tree indexing and performance monitoring. These optimizations enhance our project’s ability to handle real-time analytics at scale, providing actionable insights almost instantly. By understanding these optimizations, you’re well on your way to building a robust, real-time analytics platform with Pinot.

Stay tuned for the next post, where we’ll explore deploying Pinot in a production environment and integrating it with real-world data sources.

Series Navigation<< Apache Pinot for Production: Deployment and Integration with Apache IcebergAdvanced Apache Pinot: Sample Project and Industry Use Cases >>