Big Data

Introduction to Hadoop, Hive, and HBase

Introduction to Hadoop, Hive, and HBase

Objective

By the end of this guide, you will have installed Hadoop, Hive, and HBase on your Mac, and you’ll be ready to start implementing Big Data projects. This blog covers installation steps, configuration instructions, a proposed architecture framework, sample projects, and suggestions for further learning.


Table of Contents

  1. Introduction to Hadoop, Hive, and HBase
  2. Setting Up Hadoop, Hive, and HBase on macOS Sierra
    • Prerequisites
    • Installing Hadoop
    • Installing Hive
    • Installing HBase
  3. Proposed Architecture Framework
  4. Sample Project: Log Analysis with Hadoop, Hive, and HBase
  5. Data Flow Architecture Diagram
  6. Project Folder Structure
  7. Next Steps in Learning Hadoop, Hive, and HBase

1. Introduction to Hadoop, Hive, and HBase

Hadoop, Hive, and HBase are core components of the Big Data ecosystem:

  • Hadoop: An open-source distributed framework for storing and processing large datasets.
  • Hive: A data warehousing and SQL-like query language layer on top of Hadoop.
  • HBase: A distributed, scalable, big data store built on top of Hadoop.

These tools work together to manage and analyze Big Data by storing, querying, and retrieving data in efficient, structured, and scalable ways.


2. Setting Up Hadoop, Hive, and HBase on macOS Sierra

Prerequisites

Hardware:

  • Model: MacBook Pro (MacBookPro12,1)
  • Processor: Intel Core i7, 3.1 GHz, 2 cores
  • Memory: 16 GB

Software:

  • OS: macOS Sierra – OS X 10.12
  • Package Manager: Homebrew 1.18
  • Java: JDK 1.8 or later (required for Hadoop)

Installing Hadoop

  1. Set JAVA_HOME

    • Verify Java installation:
      bash
      $ which java $ java -version
    • Set JAVA_HOME:
      bash
      export JAVA_HOME=/Library/Java/Home echo $JAVA_HOME
    • For persistence, add JAVA_HOME to your ~/.profile:
      bash
      echo "export JAVA_HOME=/Library/Java/Home" >> ~/.profile source ~/.profile
  2. Install Homebrew

    • Open Terminal and run:
      bash
      /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
  3. Install Hadoop

    bash
    brew install hadoop
  4. Configure Hadoop

    • Edit hadoop-env.sh (located at /usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/).
    • Set HDFS directories in core-site.xml, mapred-site.xml, and hdfs-site.xml.
  5. Start Hadoop

    bash
    hdfs namenode -format hstart

Installing Hive

  1. Install Hive

    bash
    brew install hive
  2. Configure Hive

    • Set environment variables in ~/.bashrc:
      bash
      export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.0/libexec export HIVE_HOME=/usr/local/Cellar/hive/2.7.1/libexec
    • Configure hive-site.xml for MySQL metastore and JDBC driver.
  3. Start Hive

    bash
    hive hive> show tables;

Installing HBase

  1. Install Zookeeper (required by HBase)

    bash
    brew install zookeeper brew services start zookeeper
  2. Install HBase

    bash
    brew install hbase brew services start hbase
  3. Start HBase

    bash
    hbase shell

3. Proposed Architecture Framework

Here’s a high-level architecture framework using Hadoop, Hive, and HBase.

  1. Data Ingestion

    • Source: Logs or streaming data, ingested using HDFS.
  2. Data Processing with Hadoop

    • Tools: Hadoop MapReduce and Hive.
    • Function: Perform ETL, run SQL-like queries using Hive for data aggregation.
  3. Data Storage in HBase

    • Use Case: Real-time access for specific data subsets.
    • Integration: Hive queries pull HDFS data into HBase for fast querying.
  4. Data Analytics and Visualization

    • Tools: Hive for reporting, HBase for rapid data access.
    • Visualization: Connect to tools like Grafana for live dashboards.

4. Sample Project: Log Analysis with Hadoop, Hive, and HBase

Objective: Perform log analysis to monitor and flag unusual activity patterns in application logs.

Step 1: Set Up Data Ingestion with Hadoop

  1. Load Log Files to HDFS

    bash
    hdfs dfs -mkdir /logs hdfs dfs -put /path/to/log/files /logs
  2. Define Hive Table for Log Data

    sql
    CREATE EXTERNAL TABLE logs ( timestamp STRING, log_level STRING, message STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/logs';

Step 2: Process Data with Hive

  1. Analyze Logs by Severity

    sql
    SELECT log_level, COUNT(*) AS count FROM logs GROUP BY log_level;
  2. Filter and Save Suspicious Logs to HBase

    bash
    INSERT INTO TABLE hbase_table SELECT * FROM logs WHERE log_level = 'ERROR';

Step 3: Visualize Data

Use Grafana or a similar tool to set up dashboards for monitoring log trends.


5. Data Flow Architecture Diagram


6. Project Folder Structure

Below is a suggested folder structure for this project to keep code organized.

lua
log-analysis/ │ ├── src/ │ ├── main/ │ │ ├── hadoop/ │ │ │ └── logAnalysisJob.scala │ │ ├── hive/ │ │ │ └── hiveQueries.sql │ │ └── hbase/ │ │ └── hbaseStore.scala ├── resources/ │ └── config/ │ ├── hadoop-env.sh │ ├── hive-site.xml │ └── hbase-site.xml ├── data/ │ └── logs/ │ └── sample.log ├── docs/ │ └── architecture-diagram.png └── README.md
  • src/hadoop: Hadoop job files.
  • src/hive: Hive query files.
  • src/hbase: Scripts to interact with HBase.
  • resources/config: Configuration files for Hadoop, Hive, and HBase.
  • data: Directory for sample log files.

7. Next Steps in Learning Hadoop, Hive, and HBase

  1. Deep Dive into Hadoop Ecosystem

    • Learn advanced MapReduce concepts, YARN, and HDFS optimizations.
  2. Mastering Hive

    • Practice writing complex Hive queries, working with partitions and bucketing.
  3. Real-Time Applications with HBase

    • Study HBase architecture and explore schema design best practices.
  4. Build End-to-End Projects

    • Implement full data pipelines, integrating Hadoop, Hive, and HBase.
  5. Learn Visualization

    • Connect Hadoop/Hive/HBase with BI tools like Grafana or Power BI for real-time analytics.

Conclusion

With this guide, you now have Hadoop, Hive, and HBase installed and ready to use. You’ve also set up a sample project for analyzing log data, which showcases how to store and analyze Big Data using these tools. In future posts, we’ll cover deeper aspects of Hadoop, Hive, and HBase, including advanced configurations, optimization, and real-time applications. Happy Big Data journey!