Modern Data Lake

February 16, 2021November 7, 2024 by Kinshuk Dutta

Data Lake

The modern enterprise runs on data. However storing the same has always been challenging, expensive and it results in data silos. A data lake consists of a cost-effective and scalable storage system along with one or more compute engines. Data Lakes are consolidated, centralized storage areas for raw, unstructured, semi-structured, and structured data, taken from multiple sources and lacking a predefined schema.

Data Lakes have been created to save data that “may have value.” It supports a broad range of essential functions from traditional decision support to business analytics to data science. The value of data and the insights that can be gained from it are unknowns and can vary with the questions being asked and the research being done.

It should be noted that without a screening process, Data Lakes can support “data hoarding.” A poorly organized Data Lake is referred to as a Data Swamp.

Data Lakes allow Data Scientists to mine and analyze large amounts of Big Data. Big Data, which was used for years without an official name, was labeled by Roger Magoulas in 2005. He was describing a large amount of data that seemed impossible to manage or research using the traditional SQL tools available at the time. Hadoop (2008) provided the search engine needed for locating and processing unstructured data on a massive scale, opening the door for Big Data research.

In October of 2010, James Dixon, founder and former CTO of Pentaho, came up with the term “Data Lake.” Dixon argued Data Marts come with several problems, ranging from size restrictions to narrow research parameters. In describing his concept of a Data Lake, he said:

“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Data Marts

In the early 1970s, ACNielsen offered their clients a Data Mart to store information digitally and enhance their sales efforts. A “Data Mart” is an archive of stored, normally structured data, typically used and controlled by a specific community or department. It is normally smaller and more focused than a Data Warehouse and, currently, is often a subdivision of Data Warehouses. Data Marts were the first evolutionary step in the physical reality of Data Warehouses and Data Lakes.

At present, there are three basic types of Data Marts:

Independent Data Marts are not part of a Data Warehouse, and very similar to the original Data Mart offered by ACNielson. They are typically focused on one area of business or subject area. Data can be taken from both external and internal sources. It is then translated, processed, and loaded into the Data Mart, where it is stored until needed.
Dependent Data Marts are built into an existing Data Warehouse. A top-down approach is used, supporting the storage of all data in a centralized location. A clearly defined section of data is then selected for purposes of research.
Hybrid Data Marts combine the data taken from a Data Warehouse and “other” data sources. This can be useful in a variety of situations, including providing the ad hoc integration with a new group, or product, which has been added to an organization. Hybrid Data Marts are well-suited for multiple database environments and provide fast implementation turnaround. These systems make data cleansing easy, and work well with smaller data-centric applications.

Data Silos

Data Silos are part of a Data Warehouse and similar to Data Marts, but much more isolated. Data Silos are insulated management systems that cannot work with other systems. A Data Silo contains fixed data that is controlled by one department and is cut off from other parts of the organization. They tend to form within large organizations due to the different goals and priorities of individual departments. Data Silos also form when departments compete with one another instead of working as a team toward common business goals.

A few decades ago, storing a customer’s data in a silo was considered a good idea. At the time (late 1980s and early 1990s), silos were evolving alongside new technologies, such as social media and email service provider tools, and the additional security of near-total isolation seemed reasonable.

Data Silos often store “incompatible data” that is considered important enough to translate later. (Data Marts often only contain translated data.) For many organizations, a significant amount of data was stored for later translation. Eventually, Data Silos became useful as a data source for the processing of Big Data.

The Business Dictionary describes a “silo mentality” as a mindset that exists when departments or sectors within an organization decide they do not want to share their information with the rest of the organization. The results of this behavior are generally considered to have a negative impact on organizations. Two in-house silos storing the same data may have differing content, causing confusion about accuracy and questioning the age of the data in at least one of the silos. While a silo mentality can provide excellent security, Data Silos have been criticized for impeding productivity, and negatively impacting data integrity.

Data Warehouses

Though Bill Inmon presented the concept of Data Warehousing in the 1970s, the Data Warehouse’s architecture wasn’t developed until the 1980s. Data Warehouses are centralized repositories of information that can be researched for purposes of making better informed decisions. The data comes from a wide range of sources and is often unstructured. Data is accessed through the use of business intelligence tools, SQL clients, and other Analytics applications. A Data Warehouse is often built into an organization’s mainframe server or located in the Cloud.

The standard Extract, Transform, and Load-based Data Warehouse employs Data Integration, staging, and access layers in its key functions. The staging layer stores raw data taken from different data sources. The integration layer merges the data by translating it and moving it to an operational data store database. This data is then moved to the Data Warehouse database, where it is organized into hierarchical groups (called “dimensions”), facts, and aggregate facts. The access layer lets users retrieve the translated and organized data.

Data Lakes and the Cloud

“The Cloud” is a term describing hosted services available over the internet. The Cloud allows organizations to use computer resources as a utility similar to electricity, rather than building and maintaining in-house computing infrastructures. Data Lakes are available in the Cloud.

At present, Data Lakes can be used in a large variety of environments, including the Cloud. As the use of Cloud-based data services has grown, Cloud-based Data Lakes have begun to look very much like their in-house counterparts. The benefits of transferring an in-house Data Lake to the Cloud can include:

Processing and storage services within the Cloud can easily be scaled up or down, allowing customers to scale storage without the need of physically adding more computer memory.
A pay-per-use model combined with the ability to scale up and down means resources can be added as needed during peak loads, and then scaled back during slower times.
Infrastructure management and maintenance costs are reduced dramatically by transferring to a Cloud-based service.

Most of hosted Cloud storage uses an object-storage architecture. Examples include Amazon Web Services S3 (March 2006), Rackspace Files (its code was donated in 2010 to Openstack project and released as OpenStack Swift), and Google Cloud Storage (May 2010).

Object stores are a decades-old technology, but have scalability advantages and are very effective for storing diverse data types. Object stores have been used traditionally for Big Data storage and are often used for storing unstructured data (pictures, movies, music).

The Cloud’s storage and data services are continuously upgraded to meet the needs of modern Data Lake architecture, and it is quite reasonable to expect the number of Cloud-based Data Lakes to grow. The next challenge in Data Lake architecture will be finding new ways to gain insights from these Data Lakes.

Data Warehouse and The Cloud

There is an excellent article/ post by Dave Mariana which explains WHAT IS A CLOUD DATA WAREHOUSE?

A cloud data warehouse is a database delivered in a public cloud as a managed service that is optimized for analytics, scale and ease of use.

Modern Data Lakes

The usability challenges along with data consistency requirements led to a fresh new category of software projects. These projects sit between the storage and analytical platforms and offer strong ACID guarantees to the end user while dealing with the object storage platforms in a native manner. Those are now addressed by

Delta Lake
Apache Iceberg
Apache Hive

To know in-depth about the differences. Please check out the “Modern Data Lakes” blog published by developer.sh last year.

Data Lakehouse

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs

Databricks is a company founded by the original creators of Apache Spark. Databricks grew out of the AMPLab project at University of California, Berkeley that was involved in making Apache Spark, an open-source distributed computing framework built atop Scala. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. In addition to building the Databricks platform, the company is co-organizing massive open online courses about Spark and runs the largest conference about Spark – Spark Summit.

Databricks worked on Delta Lake to come up with a new architecture which is now known as Lakehouse.

As of this morning Sydney Sawaya from SDX central published a blog abut how Databricks Drives Google Cloud to the Data Lakehouse

Conclusion

Data lakes have emerged as essential infrastructure for the modern data-driven enterprise, offering scalable, flexible, and cost-effective solutions for managing a vast range of data types. By enabling organizations to store, process, and analyze raw data, data lakes unlock insights that drive better decision-making and fuel innovative applications in analytics, machine learning, and AI. However, without thoughtful organization and governance, data lakes risk devolving into “data swamps,” which underscores the importance of best practices in architecture, management, and security.

As data lake technology continues to evolve, so do the frameworks and architectures that support it. The advent of the Data Lakehouse is reshaping how organizations think about data storage, blending the strengths of traditional data warehouses with the flexibility of data lakes. In this evolving landscape, topics such as data governance, cost optimization, cloud integration, and real-world applications are becoming crucial areas of focus.

Coming Soon

In the coming months, I will delve deeper into the core components of modern data lake management and development, focusing on the following subtopics:

Data Lake vs. Data Lakehouse: Evolution and Key Differences – An exploration of how Lakehouses are redefining data storage with features like ACID compliance and support for diverse workloads.
Modern Data Lake Architecture: Essential Components and Best Practices – A guide to building and maintaining a robust, organized data lake environment, including strategies for data quality and accessibility.
Data Governance and Security in Data Lakes – Best practices and tools to ensure your data lake remains a secure, compliant, and valuable asset.
Data Lake Optimization: Enhancing Performance and Cost Efficiency – Tips on optimizing storage formats, data processing, and resource scaling to manage data lake costs effectively.
Integrating Machine Learning and AI with Data Lakes – How data lakes can support machine learning workflows, including data preparation, model training, and real-time analytics.
The Role of Cloud Providers in Data Lake Development – A look at how leading cloud providers are enhancing data lake capabilities and what organizations should consider in choosing a platform.
Real-World Case Studies: Data Lake Implementation Successes and Challenges – Insights from successful data lake deployments and lessons learned across various industries.

Stay tuned for these in-depth discussions as we continue to explore the transformative potential of modern data lakes.