TLDR; this is a summary of a tech talk I did at Open Source North in 2021. It draws on a past project in which I created an event-driven architecture and patient portal to compare prescription prices by zip code. In the process I figured out an easier way to deploy and manage Kafka on Google Cloud.
The Kafka Revolution
There are many streaming platforms on the market, but top of the list is Apache Kafka which was developed to solve issues in the area of data processing and management. Kafka was originally developed by Jay Kreps, Neha Narkhede, and Jun Rao at LinkedIn around 2009. They wanted a platform that could help handle the company’s increasing data ingestion and processing needs.
LinkedIn needed a system that could process high volumes of event data (like user activity, system logs, etc.) in real-time and make it available to various consumer systems for purposes like analytics and monitoring. Seeing the potential for wider use, LinkedIn open-sourced Kafka in 2011. The project was then submitted to the Apache Software Foundation for incubation, where it quickly gained popularity due to its performance, scalability, and fault-tolerance features.
Tight adherence to Kafka’s core principles have made it a cornerstone technology in the field of real-time data processing and streaming for companies of all sizes.
Kafka’s Top Features
Here is a fairly comprehensive list of what Kafka does best:
- High-Volume Data Processing: Modern businesses and applications often generate vast amounts of data that need to be processed and analyzed quickly. Kafka is designed to handle high throughput of data, enabling real-time processing and analysis of large data streams.
- Real-Time Data Processing: Kafka provides a framework for processing data in real-time, as opposed to batch processing. This is crucial for applications that rely on timely data analysis, such as fraud detection, live recommendations, and monitoring systems.
- Scalability: Kafka’s distributed architecture allows it to scale out to handle more data and more consumers. It can be run across a cluster of machines to ensure that data processing can keep up with the growth in data volume.
- Fault Tolerance and Durability: Kafka ensures data durability and fault tolerance by replicating data across multiple nodes. If a node fails, data can still be retrieved from other nodes, ensuring that no data is lost.
- Decoupling of Data Producers and Consumers: Kafka acts as a buffer between data producers and consumers. Producers can push data to Kafka at their own pace, and consumers can pull data from Kafka when they are ready to process it. This decoupling allows for more flexible and robust system architectures.
- Stream Processing: Kafka Streams and Kafka Connect allow for the implementation of complex stream processing pipelines, enabling the transformation, aggregation, and enrichment of streaming data on the fly.
- Distributed System Coordination: Kafka can be used as a distributed commit log, which is useful in scenarios where distributed systems need to be kept in sync. This feature is essential for maintaining consistency across distributed databases and services.
- Replayable Events: Kafka stores data in topics that can be replayed or consumed from a specific point in time. This feature is useful for recovering from failures, reprocessing data, or loading historical data for analysis.
Deploying Kafka on Kubernetes Using Helm
Before diving into the details, let’s take a look at what we will need.
Prerequisites:
- Docker Desktop with Kubernetes enabled
- Helm installed (Helm is a package manager for Kubernetes, simplifying deployment and management of applications)
- Basic understanding of Kubernetes concepts like pods, services, and deployments
Step 1: Setting up the Environment
First, ensure Kubernetes is running in Docker Desktop:
kubectl cluster-info
Step 2: Installing Helm
If you haven’t installed Helm yet, you can follow the instructions on the Helm site. Verify the installation:
helm version
Step 3: Adding Kafka Helm Chart Repository
Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of Kubernetes resources. I’ll use the Bitnami Kafka chart, which is popular and good for deploying Kafka on Kubernetes.
Add the Bitnami chart repository:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
Step 4: Deploying Kafka Using Helm
Now, let’s deploy Kafka. We’ll use the helm install
command. Here, my-kafka
is the release name, and bitnami/kafka
specifies which chart to use.
helm install my-kafka bitnami/kafka
Step 5: Verifying the Deployment
Let’s confirm that Kafka is up and running:
kubectl get pods
You should see Kafka pods starting. It might take a few minutes for all the pods to be in the RUNNING
state.
Step 6: Accessing Kafka
Kafka deployed in Kubernetes isn’t immediately accessible from outside the cluster. To interact with Kafka, we’ll create a Kafka client pod:
kubectl run kafka-client --rm --tty -i --restart='Never' --namespace default --image docker.io/bitnami/kafka:latest -- bash
Inside this client pod, you can run Kafka commands. For instance, to create a topic:
kafka-topics.sh --create --topic test --bootstrap-server my-kafka:9092
Step 7: Producing and Consuming Messages
You can now produce messages to this topic:
kafka-console-producer.sh --broker-list my-kafka:9092 --topic test
Type a message and press Enter. To consume messages, open another terminal and run:
kubectl exec -it kafka-client -- kafka-console-consumer.sh --bootstrap-server my-kafka:9092 --topic test --from-beginning
Understanding what’s going on
- Kubernetes Pods: These are like rooms in a ship where each Kafka instance (broker) resides. Notice in the example we only deployed a single broker. For production it recommended to deploy a multiple brokers, but securing and networking multiple brokers is out of scope for this post.
- Helm Charts: These are the blueprints for building and deploying ships (to use the above analogy).
By following these steps, you should be able to successfully deploy Kafka on Kubernetes using Helm. This setup is ideal for development and testing. For production, it would be critical to consider factors like persistence, security, and high availability — maybe we can explore that in a future post.