Elastic Cloud on Kubernetes Orchestration

The deployment of Elasticsearch and Kibana within a Kubernetes environment represents a paradigm shift in how distributed search and analytics engines are managed. By leveraging the container orchestration capabilities of Kubernetes, organizations can transition from manual, error-prone installations to a cloud-native operational model. This approach integrates the elasticity of Kubernetes—specifically its ability to handle scaling, self-healing, and resource abstraction—with the powerful indexing and search capabilities of the Elastic Stack. The integration is primarily facilitated through the Elastic Cloud on Kubernetes (ECK) operator, which transforms the complex lifecycle of an Elasticsearch cluster into a series of declarative state definitions.

In a traditional deployment, an administrator would be responsible for the manual configuration of every node, the painstaking setup of TLS certificates, and the complex orchestration of cluster discovery. In contrast, the Kubernetes-native approach utilizes the Operator pattern. This pattern allows the system to watch for custom resources and automatically execute the necessary steps to reach the desired state. Whether deploying on vanilla Kubernetes or managed distributions such as Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Microsoft Azure Kubernetes Service (AKS), or Red Hat OpenShift, the objective remains the same: to decouple the operational complexity of the software from the underlying infrastructure.

The Elastic Cloud on Kubernetes Operator Framework

The Elastic Cloud on Kubernetes (ECK) offering is built upon the Kubernetes Operator pattern. An operator is essentially a method of packaging, deploying, and managing a Kubernetes application. For the Elastic Stack, this means that the orchestration capabilities of Kubernetes are extended to specifically support the nuances of Elasticsearch and Kibana.

The operator functions by watching for specific custom resources. When a user submits a manifest defining an Elasticsearch cluster, the operator detects this change and begins the process of creating the necessary pods, services, and persistent volume claims. This automation extends beyond the initial deployment to include the entire lifecycle of the cluster.

The specific automation capabilities provided by ECK include:

Cluster discovery: The operator manages how nodes find each other to form a cohesive cluster, removing the need for manual seed host configuration in many scenarios.
TLS and cert management: Security is handled automatically, ensuring that communication between nodes and between the client and the cluster is encrypted.
Upgrades: Version transitions are orchestrated by the operator to ensure minimal downtime and data integrity.
Node scaling: Adding or removing capacity is as simple as updating the count in the cluster specification.
Monitoring setup: When integrated with the rest of the Elastic Stack, monitoring for the cluster is automatically configured.

The impact of this framework is a significant reduction in the operational burden on DevOps teams. Instead of executing a series of imperative commands, engineers define the end state in a YAML file, and the operator ensures that state is maintained. This creates a dense web of connectivity between the Kubernetes API and the Elasticsearch internal state, ensuring that if a pod fails, the operator and Kubernetes work in tandem to reschedule it and rebalance the data.

Deployment Methodologies and Comparative Analysis

Depending on the level of control required and the existing expertise of the operations team, there are several ways to deploy Elasticsearch on Kubernetes. These methods range from low-level primitive resources to high-level automated operators.

The following table provides a detailed comparison of the available deployment strategies:

Method	Best For	Control Level	Complexity	Notes
Raw YAML	Custom setups, full control	High	Medium	Best for fine-tuned ops teams
Helm	Quick deployments, CI/CD integration	Medium	Low	Easy to install, flexible via values.yaml
ECK	Full Elastic Stack automation	Low	Low	Operator-managed, but opinionated

The Raw YAML approach is ideal for teams that want to tune every aspect of the deployment, from the exact layout of the nodes to the specific volume provisioning, without relying on an external package manager. This typically involves using StatefulSets to ensure that pods maintain a stable identity and persistent storage.

Helm provides a middle ground, offering a standardized way to package the application. It is highly effective for CI/CD pipelines where different environments (development, staging, production) require different configurations, which can be easily managed via a values.yaml file.

The ECK operator is the most streamlined approach. It is designed for teams standardizing around the Elastic Stack who prioritize full lifecycle automation over granular, manual control. While it is more "opinionated" in how it manages the cluster, it significantly reduces the risk of human error during scaling and upgrades.

Technical Implementation via ECK

To implement an Elasticsearch cluster using the ECK operator, the operator must first be installed into the Kubernetes cluster. This process involves applying the Custom Resource Definitions (CRDs) and the operator deployment itself.

The installation is performed using the following commands:

kubectl apply -f https://download.elastic.co/downloads/eck/2.10.0/crds.yaml

kubectl apply -f https://download.elastic.co/downloads/eck/2.10.0/operator.yaml

Once the operator is active, a cluster can be defined. A basic quickstart configuration involves specifying the version, the number of nodes, and the basic configuration.

For a simple single-node cluster:

yaml cat <<EOF | kubectl apply -f - apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: quickstart spec: version: 9.4.2 nodeSets: - name: default count: 1 config: node.store.allow_mmap: false EOF

In this configuration, the node.store.allow_mmap: false setting is used. This specific configuration has performance implications and is generally not recommended for production workloads, where virtual memory tuning is required.

For a more robust, production-ready 3-node cluster, the following specification is utilized:

yaml apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: quickstart namespace: elastic-system spec: version: 8.11.0 nodeSets: - name: default count: 3 config: node.store.allow_mmap: false podTemplate: spec: containers: - name: elasticsearch resources: limits: memory: 2Gi cpu: 1 requests: memory: 2Gi cpu: 1 volumeClaimTemplates: - metadata: name: elasticsearch-data spec: accessModes: - ReadWriteOnce storageClassName: fast-ssd resources: requests: storage: 100Gi

This advanced configuration introduces resource limits and requests, ensuring that the pod has at least 2GiB of memory. If the Kubernetes cluster lacks nodes with 2GiB of free memory, the pod will remain in a Pending state. Furthermore, it specifies a volumeClaimTemplates section to request 100GiB of storage using a fast-ssd storage class, providing significantly more control over data persistence than the default 1GiB allocation found in the quickstart.

Cluster Lifecycle and Monitoring

Once the Elasticsearch manifest is applied, the ECK operator begins the process of resource creation. This process is not instantaneous; it may take several minutes for all resources to be provisioned and for the cluster to become ready.

Monitoring the deployment progress can be achieved through several kubectl commands. To get an overview of the current Elasticsearch clusters, including their health and version, the following command is used:

kubectl get elasticsearch

During the initial startup, the PHASE column will be empty and there will be no HEALTH status. As the pods and services initialize, the PHASE will transition to Ready and the HEALTH status will become green. This health status is derived directly from Elasticsearch's internal cluster health API.

To monitor the status of the individual pods, the following command is used:

kubectl get pods --selector='elasticsearch.k8s.elastic.co/cluster-name=quickstart'

If a pod is still starting, it will report a status of Pending. To investigate the internal startup process or troubleshoot errors, the pod logs can be streamed:

kubectl logs -f quickstart-es-default-0

Additionally, the ECK operator automatically creates a ClusterIP service, which provides the necessary HTTP access to the cluster. Users can also inspect the Custom Resource Definition (CRD) to understand the full specification of the Elasticsearch resource:

kubectl describe crd elasticsearch

Storage Infrastructure and Volume Provisioning

Storage is the most critical component of an Elasticsearch deployment on Kubernetes. Because Elasticsearch is a stateful application, the choice of storage class and volume type directly impacts search performance, indexing throughput, and data durability.

The standard approach is to use a dedicated StorageClass to provision volumes. For example, using AWS EBS gp3 volumes provides a balance of cost and performance.

The following configuration defines a fast-ssd StorageClass:

yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd provisioner: kubernetes.io/aws-ebs parameters: type: gp3 iops: "3000" throughput: "125" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true

When selecting storage, different options provide different trade-offs:

EBS gp3: Provides predictable performance with the ability to set custom baseline IOPS and throughput.
EBS io2: Best for large-scale indexing or search clusters with heavy disk activity, offering higher durability and guaranteed performance.
Instance store (ephemeral): Extremely fast but non-persistent. This option is only viable when data can be fully rebuilt from upstream systems, such as log ingestion pipelines.

The use of ReadWriteOnce access mode is standard for Elasticsearch data volumes, ensuring that a volume is mounted by a single node at a time, which prevents data corruption in a distributed environment.

Networking, Node Awareness, and Resource Optimization

Elasticsearch is characterized as a "chatty" service. This means that inter-node communication for replication, cluster state updates, and shard allocation happens frequently and requires low latency.

To optimize networking, it is recommended to use instance types that support Enhanced Networking (ENA) and high network throughput. Examples include m5n, r5n, and c5n instances, which can provide 10 Gbps or more. To ensure these pods land on the correct hardware, nodeSelectors or affinity rules should be employed to pin critical pods to appropriate nodes.

A critical risk in Kubernetes deployments is the possibility of a primary shard and its replica being scheduled on the same physical node. If that node fails, both the primary and the replica are lost, leading to data loss. To mitigate this, shard allocation awareness must be configured.

The following settings should be applied to ensure replicas are distributed across distinct nodes:

cluster.routing.allocation.awareness.attributes: k8s_node_name
node.attr.k8s_node_name: ${HOSTNAME}

This configuration tells Elasticsearch to use the Kubernetes node name as the attribute for awareness, ensuring that the scheduler places replicas on different physical machines.

Alternative Deployment via StatefulSets

For teams that prefer a manual approach over the ECK operator, a StatefulSet is the standard Kubernetes primitive. A StatefulSet provides guarantees about the ordering and uniqueness of pods, which is essential for Elasticsearch.

A basic 3-node cluster using a StatefulSet is defined as follows:

yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch namespace: elastic-system spec: serviceName: elasticsearch replicas: 3 selector: matchLabels: app: elasticsearch template: metadata: labels: app: elasticsearch spec: containers: - name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 ports: - containerPort: 9200 name: rest - containerPort: 9300 name: inter-node env: - name: cluster.name value: k8s-logs - name: node.name valueFrom: fieldRef: fieldPath: metadata.name - name: discovery.seed_hosts value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch" - name: cluster.initial_master_nodes value: "elasticsearch-0,elasticsearch-1,elasticsearch-2" - name: ES_JAVA_OPTS value: "-Xms1g -Xmx1g" volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: "fast-ssd" resources: requests: storage: 100Gi

This configuration provides several key mechanisms:

Stable Hostnames: Each pod is given a predictable name (e.g., elasticsearch-0), which is critical for the discovery.seed_hosts configuration.
Inter-node Communication: Communication for cluster coordination is handled on port 9300.
Memory Management: The ES_JAVA_OPTS environment variable is used to set the JVM heap size, in this case, to 1GiB.
Data Persistence: The volumeClaimTemplates ensure that each pod gets its own 100GiB volume from the fast-ssd storage class.

To deploy this configuration, the following command is used:

kubectl apply -f elasticsearch-statefulset.yaml

Integration with Kibana

Elasticsearch does not operate in a vacuum; it requires a visualization layer for data exploration and cluster management. Kibana serves as the primary UI for querying, visualizing, and debugging Elasticsearch data.

When deployed alongside Elasticsearch in a Kubernetes cluster, Kibana can be integrated via the ECK operator, which simplifies the connection between the visualization layer and the data indexing layer. This allows administrators to monitor the health of the cluster and visualize logs, metrics, and traces in real-time. For teams handling observability data, this integration is essential for managing high cardinality and ingestion spikes.

Conclusion

The deployment of Elasticsearch on Kubernetes is a sophisticated operation that requires a balanced approach to storage, networking, and resource allocation. The transition from raw YAML and StatefulSets to the Elastic Cloud on Kubernetes (ECK) operator represents a move toward full lifecycle automation, reducing the risk of operational failure while increasing the agility of the deployment.

The success of such a deployment hinges on three primary factors: storage performance, network throughput, and high availability. Utilizing fast SSD storage via appropriate StorageClasses ensures that indexing and search operations do not become I/O bound. Ensuring high network throughput through ENA-enabled instances prevents the "chatty" nature of Elasticsearch from becoming a bottleneck. Finally, implementing shard allocation awareness prevents catastrophic data loss by ensuring that replicas are physically separated across Kubernetes worker nodes.

For teams starting their journey, the recommendation is to begin with a small deployment using EKS or a similar managed service to reduce the initial setup overhead. As the cluster grows and the data complexity increases—particularly when dealing with observability logs and metrics—tuning the resource limits and moving toward a dedicated operator like ECK provides the necessary stability and scalability. Ultimately, when tuned correctly, Elasticsearch on Kubernetes provides a robust, self-healing infrastructure capable of handling the most demanding search and analytics workloads.