Big Data

Big Data Search

In order to understand the criticality of Big Data Search, we need to understand the enormity of data.

A terabyte is just over 1,000 gigabytes and is a label most of us are familiar with from our home computers. Scaling up from there, a petabyte is just over 1,000 terabytes. That may be far beyond the kind of data storage the average person needs, but the industry has been dealing with data in these sorts of quantities for quite some time. In fact, way back in 2008, Google was said to process around 20 petabytes of data a day (Google doesn’t release information on how much data it processes today). To put that in context, if you took all of the information from all US academic research libraries and lumped it all together, it would add up to 2 petabytes. Scaling up again, you have exabytes (roughly 1,000 petabytes) and zettabytes (a little over 1,000 exabytes). At this stage, it becomes hard to comprehend what any of this means in real terms. Try this: according to a Cisco estimate, the world’s collective internet usage reached one zettabyte in 2016. That’s a lot of cat videos being viewed!  So, as the world’s data has grown, we’re now talking about data in terms of zettabytes.

Bernard Marr

At the time of writing this Blog, there are about 60 Zettabytes of data available globally.

The volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2024 (in zettabytes)
https://www.statista.com/statistics/871513/worldwide-data-created/

SOLR + Hadoop = Big Data Search

Hadoop, a structure, and collection of tools for processing enormous data sets were originally designed to work on clusters of physical machines. However it has changed over the period and now many people use the Hadoop open source project to process large data sets by using distributed analytic frameworks because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers.

Nutch which started as an open-source web crawler project in 2006 has evolved into the Big Data quintessential. Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”).

The Solr open source project which also started at roughly the same time has become the most widely used search solution. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality.

By combining these two open-source projects one can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. So, the blend of big data and compute power likewise allows analysts to investigate new behavioral data.

Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality.

In the year 2012, during a conference, Mr. M.C. Srivas from MapR explained the use of Map Reduce to achieve the same.

Evolution of Hadoop

I keep my Hadoop (Big Data) blogs updated as per the latest changes. However, the following image depicts the major changes in Hadoop’s lifeline. It was published by Data Flair