Last year around this time, over a warm cup of hot cocoa, I reflected on the NotPetya cyberattack, a global catastrophe that reshaped how we perceive cybersecurity threats. My detailed insights into the incident, shared in my post “NotPetya: Unmasking the World’s Most Devastating Cyberattack”, explored its massive economic, political, and technological impact. Fast forward a year, and here I am again, contemplating another major cyber incident—the Kaseya VSA ransomware attack. While the actors and the methods have evolved, the lessons remain eerily similar: we live in an increasingly connected world where vulnerabilities in software and systems can cascade into…
-
-
Introduction Back in 2013, I began blogging about Big Data, diving into the ways massive data volumes and new technologies were transforming industries. Over the years, I’ve explored various aspects of data management, from data storage to processing frameworks, as these technologies have evolved. Today, the conversation has shifted towards decentralized data architectures, with Data Fabric and Data Mesh emerging as powerful approaches for enabling agility, scalability, and data-driven insights. In this blog, I’ll discuss the core concepts of Data Fabric and Data Mesh, their key differences, and their roles in modern applications. I’ll also share a bit of my…
-
In 2019, we explored the foundations of neural networks—how layers of interconnected nodes mimic the human brain to extract patterns from data. Since then, one area where neural networks have truly transformed the landscape is Natural Language Processing (NLP). What was once rule-based and statistical has now evolved into something more fluid, contextual, and surprisingly human-like—thanks to Large Language Models (LLMs) built atop deep neural architectures. We touched upon this topic in early 2020 in our blog 🧠 Understanding the Correlation Between NLP and LLMs lets keep momentum and try understand Neural Networks empowers NLP and LLM. The NLP Challenge:…
-
“Before machines can understand us, they need to know where one word ends and another begins.” 🧠 Introduction: Why Tokenization Matters Natural Language Processing (NLP) has made astounding progress—from spam filters to chatbots to sophisticated language models like GPT-3. But at the heart of every NLP system lies a deceptively simple preprocessing step: tokenization. Tokenization is how raw text is broken into tokens—units that an NLP model can actually understand and process. Without tokenization, words like “can’t”, “data-driven”, or even emoji 🧠 would remain indistinguishable gibberish to machines. This blog dives into what tokenization is, the types of tokenizers, the…
-
Introduction: Enhancing Trino Performance In our journey with Trino, we’ve explored its setup, integrated it with multiple data sources, added real-time data, and expanded to cloud storage. To wrap up, we’ll focus on strategies to improve query performance. Specifically, we’ll implement caching techniques and apply performance tuning to optimize queries for frequent data access. This final post aims to equip you with tools for building a highly responsive and efficient Trino-powered analytics environment. Goals for This Post Implement Caching for Frequent Queries: Set up a local cache for repeated queries to reduce data retrieval times and resource consumption. Tune Query…
-
Introduction: Scaling Data with Cloud Storage In the previous blogs, we explored building a sample project locally, optimizing queries, and adding real-time data streaming. Now, let’s take our Trino project a step further by connecting it to cloud storage, specifically Amazon S3. This integration will showcase how Trino can handle large datasets beyond local storage, making it suitable for scalable, cloud-based data warehousing. By connecting Trino to S3, we can expand our data analytics project to manage vast datasets with flexibility and efficiency. Project Enhancement Overview Goals for This Blog Post Integrate Amazon S3 with Trino: Configure Trino to access…
-
Introduction: Building on the Basics In our last blog, we set up a local Trino project for a sample use case—Unified Sales Analytics—allowing us to query across PostgreSQL and MySQL databases. Now, we’ll build on this project by introducing optimizations for query performance, configuring advanced settings, and adding a new data source to broaden the project’s capabilities. These enhancements will simulate a real-world scenario where data is frequently queried, requiring efficient processing and additional flexibility. Project Enhancement Overview Goals for This Blog Post Optimize Existing Queries: Improve query performance by using Trino’s advanced optimization features. Add a New Data Source:…
-
Why a Trino Series Instead of Presto? If you followed the initial post in this series, you may recall we discussed the history of Presto and its recent transformation into what is now known as Trino. Originally developed as Presto at Facebook, this powerful SQL query engine has seen an incredible journey. The transition to Trino represents the evolution of PrestoSQL into a more robust, community-driven platform focused on advanced distributed SQL features. The rebranding to Trino wasn’t merely a name change—it reflects a shift toward greater community collaboration, improved flexibility, and extended support for analytics across a wide variety…
-
Solr is the popular, blazing-fast, open-source enterprise search platform built on Apache Lucene™. Here is a example of how Solr might be integrated into an application This blog has a curated list of SOLR packages and resources. It starts with how to install and then show some basic implementation and usage. Installing Solr Typically in order to install on my Mac, I always use Homebrew first update your brew: brew update Updated Homebrew from 37714b5ce to 373a454ac. then install solr: brew install solr However this time I am going to show step by step installation on mac as explained in…
-
Kinshuk Dutta New York