Strimzi-Driven Kafka Orchestration and Kubernetes State Management

The deployment of Apache Kafka within a Kubernetes environment represents a fundamental shift in how distributed streaming platforms are managed, moving away from manual server configuration toward an operator-driven, declarative model. By utilizing the Strimzi operator, organizations can automate the complex lifecycle of Kafka clusters, including the deployment of brokers, the management of topics, and the orchestration of users, while leveraging Kubernetes' native scaling and self-healing capabilities. This architectural approach allows for the implementation of KRaft mode, which eliminates the dependency on ZooKeeper by consolidating metadata management within the Kafka controllers themselves, thereby reducing the operational overhead and potential points of failure associated with maintaining a separate coordination service.

Strimzi Operator Infrastructure and Deployment

The Strimzi operator serves as the central intelligence for Kafka on Kubernetes, translating high-level Custom Resource (CR) definitions into the actual operational state of the cluster. It manages the entire lifecycle, from initial installation to the scaling of node pools and the administration of the entity operator.

The operational flow of Strimzi is structured as a hierarchical dependency chain. The Strimzi Operator manages the Kafka Cluster CR, which in turn governs the KafkaNodePool CR and the Entity Operator. The Entity Operator further subdivides into the Topic Operator, which handles the creation and modification of Kafka topics, and the User Operator, which manages authentication credentials and access control. At the lowest level, the KafkaNodePool CR interacts with StrimziPodSet, ensuring that each broker is associated with a Persistent Volume Claim (PVC) to maintain data persistence across pod restarts.

To implement this infrastructure, several prerequisites must be met. The environment requires a Kubernetes cluster compatible with Strimzi 0.44.0, specifically versions v1.25 through v1.31. Additionally, kubectl must be configured for cluster communication, Helm v3 must be installed for package management, and a StorageClass capable of dynamic provisioning must be available to support the persistent storage requirements of the Kafka brokers.

The installation process begins with the addition of the Strimzi Helm repository and the creation of a dedicated namespace.

bash helm repo add strimzi https://strimzi.io/charts/ helm repo update

The operator is then installed into the kafka namespace with the following configuration:

bash kubectl create namespace kafka helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \ --namespace kafka \ --set watchNamespaces="{kafka}" \ --version 0.44.0

Verification of the operator's operational status is performed by checking the readiness of the operator pod:

bash kubectl get pods -n kafka -l name=strimzi-cluster-operator

Production Cluster Configuration and KRaft Implementation

A production-grade Kafka deployment requires a rigorous set of specifications to ensure fault tolerance, data integrity, and performance. The shift toward KRaft mode (Kafka Raft) allows the cluster to operate without ZooKeeper, integrating the controller role directly into the Kafka nodes.

The KafkaNodePool resource defines the physical characteristics of the nodes. For a production environment, a minimum of 3 replicas is required to maintain fault tolerance. Each node is assigned both the controller and broker roles to ensure high availability of the metadata and the data plane.

The storage configuration for these nodes utilizes persistent claims with a size of 100Gi and the standard storage class. Setting deleteClaim: false is critical, as it prevents the accidental deletion of the Persistent Volume Claims when the node pool is modified or deleted, ensuring that data is not lost during cluster maintenance.

Resource allocation is managed through requests and limits to prevent noisy neighbor issues and ensure stability.

  • Memory requests are set to 2Gi, with limits at 4Gi.
  • CPU requests are set to 500m, with limits at 2.

To optimize the Java Virtual Machine (JVM) for Kafka's memory-intensive operations, the following options are applied:

  • -Xms: 1024m (Initial heap size)
  • -Xmx: 2048m (Maximum heap size)

The Kafka Custom Resource further defines the cluster's behavioral properties. Version 3.8.0 is utilized with a metadata version of 3.8-IV0. Listeners are configured to handle both internal and secure communication. The plain listener operates on port 9092 for internal communication without TLS, utilizing scram-sha-512 for authentication. The tls listener operates on port 9093 and enables TLS encryption for secure data transmission.

The internal configuration settings for the cluster are designed for high availability and durability:

  • num.partitions: 3 (Default for new topics to allow parallel processing).
  • offsets.topic.replication.factor: 3 (Ensures the offset topic is highly available).
  • transaction.state.log.replication.factor: 3 (Protects transaction logs).
  • transaction.state.log.min.isr: 2 (Minimum in-sync replicas for transaction logs).
  • default.replication.factor: 3 (Standard replication for all topics).
  • min.insync.replicas: 2 (Guarantees that at least two replicas acknowledge a write before it is considered successful).
  • log.retention.hours: 168 (Retains data for 7 days).
  • log.segment.bytes: 1073741824 (Sets segment size to 1 GB).

The configuration is finalized by integrating the JMX Prometheus exporter for monitoring, which references a ConfigMap for its rules.

```yaml

kafka-cluster.yaml fragment

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: production-cluster
namespace: kafka
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
version: 3.8.0
metadataVersion: 3.8-IV0
listeners:
- name: plain
port: 9092
type: internal
tls: false
authentication:
type: scram-sha-512
- name: tls
port: 9093
type: internal
tls: true
config:
num.partitions: 3
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
log.retention.hours: 168
log.segment.bytes: 1073741824
authorization:
type: simple
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
```

Network Topology and Service Discovery

In a Kubernetes cluster, service discovery for Kafka is handled via the internal DNS system. This allows producers, consumers, and brokers to communicate using stable network identifiers rather than volatile Pod IP addresses.

When using a Headless Service, Kubernetes assigns a DNS entry for each pod in the StatefulSet. For example, the broker kafka-0 can be reached via kafka-0.kafka.production.svc.cluster.local. This granularity is essential for Kafka because clients need to connect directly to the broker that is the leader for a specific partition.

DNS lookups can be verified using a tool like nslookup within a busybox pod.

```bash

Testing individual broker DNS

nslookup kafka-0.kafka

Result: Address 1: 10.42.0.14 kafka-0.kafka.production.svc.cluster.local

Testing the service DNS

nslookup kafka

Results:

Address 1: 10.42.0.15 kafka-1.kafka.production.svc.cluster.local

Address 2: 10.42.0.16 kafka-2.kafka.production.svc.cluster.local

Address 3: 10.42.0.14 kafka-0.kafka.production.svc.cluster.local

```

This structure ensures that if a broker fails and is restarted by Kubernetes, the DNS entry remains consistent, and the client can re-establish a connection based on the bootstrap configuration. Producers and consumers use these replicas in their bootstrap configuration, and brokers use them to communicate with the KRaft controllers or ZooKeeper replicas.

Producer Configuration and Event Injection

Interacting with a Kafka cluster on Kubernetes requires a producer configuration that matches the security and reliability settings defined in the cluster CR.

The following configuration is used for a Python-based producer:

  • bootstrap.servers: production-cluster-kafka-bootstrap.kafka.svc.cluster.local:9092 (The internal bootstrap service).
  • client.id: k8s-producer (Identifies the client for logging and auditing).
  • security.protocol: SASL_PLAINTEXT (Ensures communication is authenticated via SASL).
  • sasl.mechanisms: SCRAM-SHA-512 (Matches the cluster's authentication mechanism).
  • sasl.username: app-producer (The authorized user).
  • sasl.password: Fetched from environment variables via os.environ["KAFKA_PASSWORD"] for security.

To prevent data loss and ensure consistency, the producer employs idempotency and strict acknowledgement settings:

  • enable.idempotence: True (Ensures that messages are delivered exactly once, preventing duplicates during retries).
  • acks: "all" (The producer waits for all in-sync replicas to acknowledge the write).

Example event injection code:

```python

Send a sample event

event = {"user_id": "u-123", "action": "login"}
producer.produce(
topic="user-events",
key="u-123",
value=json.dumps(event),
)
producer.flush()
print("Event sent successfully")
```

Monitoring and Metrics Integration

Observability in a Kafka cluster is achieved by exporting JMX (Java Management Extensions) metrics to Prometheus. Strimzi facilitates this through the jmxPrometheusExporter.

A ConfigMap is required to define how JMX metrics are mapped to Prometheus format. This allows the SRE team to track broker-level and partition-level health.

```yaml

kafka-metrics-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: kafka
data:
kafka-metrics-config.yml: |
lowercaseOutputName: true
rules:
# Broker-level metrics
- pattern: "kafka.server name: "kafkaserverbrokertopicmetrics$1total"
type: COUNTER
# Partition-level metrics
- pattern: "kafka.log name: "kafkalogsize"
labels:
topic: "$1"
partition: "$2"
```

The application of this ConfigMap is handled via kubectl apply -f kafka-metrics-configmap.yaml. Once active, the Prometheus server can scrape these metrics, allowing for real-time monitoring of message throughput and log sizes.

Kafka Streams on Kubernetes: State and Storage Analysis

Deploying Kafka Streams applications on Kubernetes introduces significant challenges regarding state management. Kafka Streams utilizes local state stores (typically RocksDB) to maintain windowed aggregates and joins, necessitating persistent storage.

Storage Strategy Comparison

The choice of storage for Kafka Streams involves a trade-off between stability and recovery speed.

Storage Type Pros Cons
Local Persistent Volumes High performance, low latency Difficult to migrate across nodes, operational headaches
Ephemeral Volumes Easy to manage, no locking issues Downtime during restarts, state must be rebuilt
Network File System (NFS) Shared access, centralized management Locking issues, high latency, risk of crashes

The NFS Locking Problem

A critical failure point when using NFS for Kafka Streams is the issue of file system locking. Kafka Streams relies on strict locks on the file system to ensure that only one instance of a stream task is writing to a state store. NFS does not provide the necessary locking guarantees, which leads to application crashes. Therefore, NFS is not recommended for persisting state stores.

Ephemeral Volume Trade-offs

Ephemeral volumes are often preferred for low-throughput, high-retention use cases. While they cause downtime during deployments or restarts because the state must be restored from the Kafka changelog topic, the restoration time is often acceptable. For instance, a 300GB state store can be restored in approximately 20 minutes, depending on the available network, CPU, and disk resources.

StatefulSet Operational Challenges

The recommended approach for persisting state is using StatefulSets with Local Persistent Volumes. However, this introduces several "operational headaches":

  • Single Replica Blockage: Issues with a single replica can potentially block the entire StatefulSet, hindering updates or scaling.
  • Node Dynamics: Adding or removing nodes from the Kubernetes cluster can become problematic, as pods are tied to specific local volumes.
  • Autoscaling Complexity: Scaling a Kafka Streams StatefulSet up or down is difficult due to the redistribution of state and the need to manage Persistent Volume Claims (PVCs).
  • Outage Recovery: Recovering from a full cluster outage can take significantly longer than expected due to the sequential nature of StatefulSet pod recovery.

To mitigate these issues, the following configuration combo is recommended:

  • Use StatefulSets for identity.
  • Utilize Local Persistent Volumes for RocksDB.
  • Implement Kafka Streams Static Membership by setting group.instance.id = ${HOSTNAME}.
  • Assign a high Pod Scheduling Priority to ensure critical stream processing pods are not preempted.

If a pod in a StatefulSet becomes Pending, the manual deletion of its corresponding Persistent Volume Claims is required to allow the pod to be rescheduled on a different Kubernetes node.

Production Readiness Checklist

To ensure a Kafka deployment is production-ready, the following recommendations must be implemented:

  • Broker Count: A minimum of 3 brokers must be deployed to ensure fault tolerance.
  • Replication Factor: Set the replication factor to 3.
  • Minimum In-Sync Replicas: Set min.insync.replicas=2 to ensure data is written to at least two brokers before being acknowledged.
  • Storage: Use Persistent Volumes to ensure data survives pod crashes and restarts.

Analysis of Kubernetes Statefulness for Streaming

The deployment of Kafka and Kafka Streams on Kubernetes highlights a fundamental tension between the cloud-native philosophy of ephemeral, stateless workloads and the reality of distributed data systems. While the Strimzi operator successfully abstracts the complexity of Kafka broker deployment through KRaft and the KafkaNodePool CR, the state management of Kafka Streams remains an area of operational friction.

The reliance on RocksDB for state stores means that the "local" nature of the data is paramount. When this local state is moved to a network-based system like NFS, the failure of the locking mechanism renders the system unstable. This proves that not all "persistent" storage is created equal; the specific requirements of the application (in this case, atomic file locking) dictate the infrastructure choice.

Furthermore, the transition from ZooKeeper to KRaft simplifies the architecture but does not eliminate the inherent challenges of the StatefulSet. The dependence on PVCs and the resulting "stickiness" of pods to specific nodes create a rigid infrastructure that resists the fluid autoscaling typical of Kubernetes. The use of Static Membership (group.instance.id) is a critical optimization here, as it reduces the frequency of rebalances during pod restarts, thereby increasing the overall stability of the streaming pipeline.

Ultimately, the success of Kafka on Kubernetes depends on matching the storage strategy to the throughput and retention needs of the application. For most, the trade-off of using ephemeral volumes—accepting a 20-minute recovery window for 300GB of data—is preferable to the operational fragility of NFS or the management complexity of Local Persistent Volumes.

Sources

  1. OneUptime
  2. Confluent Forum
  3. Redpanda

Related Posts