MDM, Big Data, Enterprise Application Integration, KAFKA, Integration, Event Streaming

KAFKA Basics

This entry is part 5 of 5 in the series KAFKA Series

Apache Kafka has transformed the world of data streaming and event-driven architectures. In this blog, we’ll dive into Kafka’s fundamentals and build a step-by-step sample project to demonstrate its capabilities. This project will showcase Kafka’s distributed nature and streaming potential, giving you a practical approach to setting up, running, and testing a Kafka cluster on macOS.


Table of Contents

  1. Kafka Basics
  2. Installing Kafka on macOS
  3. Basic Operations
  4. Sample Project: Kafka Real-Time Data Pipeline
  5. Conclusion and Next Steps

Kafka Basics

What is Kafka

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue capable of handling high volumes of data. Built with scalability, reliability, and low-latency streaming in mind, Kafka can handle real-time data feeds in enterprise environments. Kafka’s unique storage layer makes it suitable for both offline and online message consumption by persisting messages on disk and replicating them within a cluster.

Kafka System Architecture

Kafka’s distributed architecture enables it to run as a cluster across multiple servers. It organizes messages into topics, where each message consists of a key, value, and timestamp. Kafka’s architecture includes:

  • ZooKeeper: For managing Kafka brokers.
  • Brokers: Handle data storage and distribution across the cluster.
  • Producers: Publish messages to Kafka topics.
  • Consumers: Subscribe to topics to process messages.

Kafka API

Kafka offers four core APIs:

  • Producer API: Publishes streams of records to topics.
  • Consumer API: Subscribes to topics and processes records.
  • Streams API: Processes and transforms data within Kafka.
  • Connector API: Integrates Kafka with external systems.
Source: https://kafka.apache.org/intro

 

Why Kafka

Kafka’s popularity stems from its robustness:

  • Reliability: Kafka’s fault-tolerant, distributed nature makes it reliable for high-volume applications.
  • Scalability: Kafka can scale horizontally without downtime.
  • Durability: Kafka’s distributed commit log ensures data persistence.
  • Performance: Kafka achieves high throughput for publishing and subscribing, even with TBs of data.

Installing Kafka on macOS

To install Kafka on macOS using Homebrew:

bash
$ brew install kafka

If you lack ZooKeeper, this standalone Kafka installation will include the required dependencies.

To update Kafka, run:

bash
$ brew update

Basic Operations

Starting Kafka and ZooKeeper

Kafka requires ZooKeeper to coordinate between brokers. Start ZooKeeper:

bash
$ zkserver start

Now, start Kafka:

bash
$ brew services start kafka

Or start it explicitly from the Kafka directory:

bash
$ bin/kafka-server-start.sh config/server.properties

Creating a Topic

To create a topic for storing messages, run:

bash
$ bin/kafka-topics.sh --create --topic my-kafka-topic --zookeeper localhost:2181 --partitions 3 --replication-factor 2

This command configures my-kafka-topic with 3 partitions and a replication factor of 2.


Sample Project: Kafka Real-Time Data Pipeline

In this project, we’ll set up a basic real-time data pipeline using Kafka producers and consumers to simulate data flow through a Kafka topic. The goal is to send streaming data from a producer and receive it with a consumer in real-time.

Project Structure

sql
kafka-real-time-pipeline/ │ ├── config/ │ ├── server.1.properties │ ├── server.2.properties │ └── server.3.properties │ ├── scripts/ │ ├── start-zookeeper.sh │ ├── start-kafka-brokers.sh │ ├── create-topic.sh │ ├── src/ │ ├── producer/ │ │ └── DataProducer.java │ └── consumer/ │ └── DataConsumer.java │ └── README.md
  • config/: Configuration files for Kafka brokers.
  • scripts/: Scripts to start services and create topics.
  • src/: Source code for the producer and consumer applications.

Step 1: Setting Up Brokers

To demonstrate Kafka’s distributed nature, configure three brokers:

  1. Copy config/server.properties three times to server.1.properties, server.2.properties, and server.3.properties.

  2. Modify each file as follows:

    • server.1.properties

      properties
      broker.id=1 listeners=PLAINTEXT://:9093 log.dirs=/tmp/kafka-logs1
    • server.2.properties

      properties
      broker.id=2 listeners=PLAINTEXT://:9094 log.dirs=/tmp/kafka-logs2
    • server.3.properties

      properties
      broker.id=3 listeners=PLAINTEXT://:9095 log.dirs=/tmp/kafka-logs3
  3. Run each broker in separate terminals:

    bash
    $ bin/kafka-server-start.sh config/server.1.properties $ bin/kafka-server-start.sh config/server.2.properties $ bin/kafka-server-start.sh config/server.3.properties

Step 2: Creating a Producer

The producer application sends data to a Kafka topic.

DataProducer.java:

java
import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import java.util.Properties; public class DataProducer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9093,localhost:9094,localhost:9095"); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); KafkaProducer<String, String> producer = new KafkaProducer<>(props); for (int i = 0; i < 10; i++) { producer.send(new ProducerRecord<>("my-kafka-topic", Integer.toString(i), "Message " + i)); } producer.close(); } }

Step 3: Creating a Consumer

The consumer application reads data from the Kafka topic.

DataConsumer.java:

java
import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.clients.consumer.KafkaConsumer; import org.apache.kafka.clients.consumer.ConsumerRecord; import java.util.Properties; import java.util.Collections; public class DataConsumer { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9093,localhost:9094,localhost:9095"); props.put("group.id", "test-group"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Collections.singletonList("my-kafka-topic")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) System.out.printf("Consumed message: %s%n", record.value()); } } }

Testing the Kafka Pipeline

  1. Start Producer: Run DataProducer to send messages to my-kafka-topic.
  2. Start Consumer: In another terminal, run DataConsumer to consume messages.
  3. Verify Output: Observe the consumer receiving messages in real-time.

Conclusion and Next Steps

In this blog, we’ve covered Kafka’s architecture, installation, and a real-time data pipeline project. For advanced exploration, consider adding multiple consumers, experimenting with different partition counts, and testing real-time data transformations using Kafka Streams.

Apache Kafka remains a powerful, adaptable tool in data streaming, and this project demonstrates its potential in a scalable, distributed setup. Future blogs will dive deeper into advanced Kafka configurations and integrations with other data-processing frameworks.

 

Series Navigation<< Advanced Kafka Configurations and Integrations with Data-Processing Frameworks