Kubernetes Pod Eviction

Pod eviction in Kubernetes represents a critical operational state where a Pod assigned to a Node is targeted for termination. This process is fundamentally a resource management mechanism, designed to ensure that the underlying infrastructure remains stable even when the demands of the applications exceed the available physical or virtual capacities of the hardware. When a Pod is evicted, it is terminated, often leaving it in a failed state and forcing the cluster to reconcile the desired state of the application.

The phenomenon of eviction is not a random failure but a structured response to specific triggers. These triggers primarily fall into two categories: Preemption and Node-pressure. Preemption occurs when the scheduler identifies a need to place a high-priority Pod on a Node that is already full; to make room, the scheduler terminates lower-priority Pods. Node-pressure eviction, conversely, is a reactive process managed by the kubelet when a Node's resources—such as memory, CPU, or disk space—drop below a critical threshold.

For the end-user or the system administrator, an evicted Pod translates to an immediate disruption of service for that specific instance of the application. Depending on the workload controller used, such as a Deployment, the system will automatically attempt to schedule a replacement Pod on a different Node, thereby maintaining the required replica count. However, this "musical chairs" effect can lead to cascading failures if the entire cluster is under resource pressure, potentially leaving Pods "nodeless" and unable to find a suitable home.

Preemption and Priority-Based Termination

Preemption is a proactive scheduling strategy used by Kubernetes to ensure that critical workloads are always running, regardless of the current cluster congestion. It is the process of terminating Pods with lower Priority so that Pods with higher Priority can be successfully scheduled on Nodes.

This mechanism relies heavily on Pod Priority Classes. By assigning a priority level to a Pod, an administrator informs the Kubernetes scheduler of the relative importance of that workload. When a high-priority Pod cannot be scheduled due to insufficient resources on all available Nodes, the scheduler looks for Nodes where terminating one or more lower-priority Pods would free up enough space to accommodate the high-priority candidate.

The impact of preemption is an intentional disruption of less critical services to ensure the survival of the most vital ones. This creates a hierarchy of stability within the cluster, where the highest-priority Pods are the most resilient to eviction, while the lowest-priority Pods are the most volatile.

Node-Pressure Eviction and the Kubelet

Node-pressure eviction is a reactive safety mechanism executed by the kubelet. Unlike preemption, which is driven by the scheduler to make room for new Pods, node-pressure eviction is triggered when the Node itself is in danger of crashing or becoming unresponsive due to resource exhaustion.

The kubelet constantly monitors the resource usage of the Node. When the used resources exceed a predefined eviction threshold, the kubelet begins the eviction process to bring the Node back to a healthy state. This process involves failing Pods until the resource usage falls back below the threshold. Once the kubelet decides to evict, it terminates all containers within the Pod and sets the PodPhase to Failed.

In scenarios where a Deployment manages the evicted Pod, the Deployment controller will detect the failure and request the creation of a new Pod. This new Pod will then be passed to the scheduler to find a Node with sufficient resources, effectively shifting the load from a stressed Node to a healthier one.

The Hierarchy of Eviction: QoS Classes

When the kubelet initiates node-pressure eviction, it does not select Pods randomly. Instead, it follows a strict ranking system based on Quality of Service (QoS) classes. This ensures that the most "stable" Pods are the last to be removed.

The kubelet ranks Pods for eviction in the following order:

  • BestEffort Pods
  • Burstable Pods where usage exceeds requests
  • Burstable Pods where usage is below requests
  • Guaranteed Pods

The logic behind this hierarchy is as follows:

  • BestEffort Pods: These are Pods that have no requests or limits defined. Because they make no claims on resources, they are viewed as the most disposable and are the first to be evicted.
  • Burstable Pods (Usage > Requests): These Pods have defined requests but are currently consuming more than what they requested. They are seen as contributing to the node's pressure and are prioritized for eviction next.
  • Burstable Pods (Usage < Requests): These Pods are operating within their requested limits. They are evicted only after all BestEffort and over-limit Burstable Pods have been cleared.
  • Guaranteed Pods: These Pods have requests and limits that are exactly equal. They are generally safe from eviction when the kubelet needs to make room for other Pods. However, if system services themselves require more resources to maintain Node stability, the kubelet will terminate Guaranteed Pods if necessary, starting with those that have the lowest priority.

The real-world consequence for developers is that setting very low resource requests significantly increases the likelihood that a Pod will be classified into group 1 (BestEffort) or group 2 (Burstable exceeding requests), making it a prime target for eviction during times of stress.

Resource Recovery and Disk Cleaning

Before the kubelet begins the aggressive process of terminating active Pods, it attempts "quick wins" to reclaim resources, particularly concerning disk space.

The first action the kubelet takes to free disk space is the deletion of non-running pods and their associated images. This is a non-disruptive way to clear cache and temporary files. If this cleaning process is insufficient to bring the Node below the eviction threshold, the kubelet then moves to the Pod eviction hierarchy mentioned previously.

For example, on a Node experiencing CPU issues, the kubelet will evaluate the current CPU consumption against the requested amounts and start failing Pods in the order of BestEffort, then Burstable (exceeding requests), and finally Burstable (below requests).

Specialized Eviction Types

While preemption and node-pressure are the most common drivers, Kubernetes supports several other mechanisms for evicting Pods.

API-Initiated Eviction

Users and administrators can request an on-demand eviction of a Pod using the Kubernetes Eviction API. This is a controlled way to remove a Pod from a Node without simply deleting it, allowing the system to handle the transition according to defined policies.

Taint-Based Eviction

Taints and Tolerations are used to guide Pod scheduling. A NoExecute taint, when applied to a Node, instructs Kubernetes to evict any existing Pods on that Node that do not have a corresponding toleration.

The process for taint-based eviction is often linked to Node health:

  • The kubelet reports a heartbeat every 10 seconds to the Kubernetes API server via a Lease resource.
  • The node-lifecycle controller monitors this Lease. If no heartbeat is received for 50 seconds (a configurable value), the controller sets the Node's Ready condition to Unknown.
  • The node-lifecycle controller then adds an unreachable taint to the Node with the effect=NoExecute.
  • By default, Kubernetes adds a toleration to every Pod to tolerate this NoExecute taint for 5 minutes. This allows a grace period for temporary network partitions.
  • Once this 5-minute window expires, if the Pod still does not tolerate the taint, it is evicted.

Node Drain

Node draining is a voluntary disruption used during maintenance or when a Node is deemed unusable. The process typically involves two steps:

  • kubectl cordon: This prevents new Pods from being scheduled on the Node.
  • kubectl drain nodename: This command completely empties the Node of all current Pods. This process respects the graceful termination period of the Pods, ensuring they can shut down cleanly.

Local Storage Eviction

Introduced as a specific mechanism, the kubelet can evict Pods based on ephemeral storage usage. This includes logs or scratch filesystem writes, as well as size limits configured on emptyDir volumes. In these cases, the kubelet terminates the Pod gracefully and moves it to a terminal phase based on the exit code of the process.

Eviction Thresholds: Hard vs. Soft

The kubelet distinguishes between different types of eviction thresholds, which impact how the termination is executed.

  • Hard Thresholds: These cause the kubelet to immediately terminate pods. Hard evictions are lapped with a forced deletion (equivalent to kubectl delete pod --force), which specifies gracePeriodSeconds=0. Consequently, hard evictions do not respect the Pod's termination grace periods.
  • Soft Evictions: These are less aggressive but still may ignore the graceful termination period and cap the grace period at a preconfigured value.

Both hard and soft evictions can be disabled within the kubelet configuration if the administrator prefers to manage resource pressure through other means.

Monitoring Evictions with Prometheus

For operators, identifying that Pods are being evicted is the first step toward rightsizing a cluster. Prometheus can be used to monitor these events by querying the status of Pods.

The following query identifies all evicted Pods in a cluster:

kube_pod_status_reason{reason="Evicted"} > 0

To refine this monitoring, this query can be paired with another to identify Pods that were evicted specifically after a failure:

kube_pod_status_phase{phase="Failed"}

By monitoring these metrics, technical teams can determine if they need to perform capacity planning or rightsize the resource requests and limits of their cluster to prevent the "musical chairs" effect of constant evictions.

Summary of Eviction Characteristics

Eviction Type Trigger Mechanism Grace Period
Preemption High-priority Pod scheduling Scheduler terminates lower priority Respected
Node-Pressure Resource exhaustion (CPU/RAM/Disk) Kubelet terminates based on QoS Varies (Hard vs Soft)
API-Initiated Manual request via API Eviction API Respected
Taint-Based NoExecute taint / Node unreachable Node-lifecycle controller 5-minute default
Node Drain Maintenance / Node decommission kubectl drain Respected
Local Storage Ephemeral storage / emptyDir limits Kubelet termination Respected

Analysis of Eviction Dynamics

Eviction is not a failure of the Kubernetes system, but rather a manifestation of its primary goal: the maintenance of cluster stability. When viewed through the lens of resource management, eviction is the necessary counterweight to the dynamic nature of containerized workloads.

The tension between Guaranteed, Burstable, and BestEffort classes creates a predictable ecosystem. For the developer, this means that the PodSpec is not just a configuration file for the application, but a contract with the cluster. A developer who fails to define resource requests is essentially accepting a "BestEffort" contract, acknowledging that their workload is the most expendable in the event of a crisis.

The movement from lapped resource cleaning (deleting unused images) to the tiered eviction of Pods shows a sophisticated approach to resource recovery. By prioritizing the removal of those who exceed their requests, Kubernetes penalizes "greedy" Pods before it touches those that are behaving within their specified limits.

The distinction between voluntary and involuntary disruptions is critical for reliability engineering. Voluntary disruptions (draining, API evictions) allow for controlled migrations. Involuntary disruptions (node-pressure, preemption) are the system's emergency brakes. The ultimate goal for any production environment is to minimize involuntary disruptions by utilizing correct capacity planning and precise resource requests, thereby ensuring that the kubelet rarely needs to resort to the eviction hierarchy.

Sources

  1. Sysdig
  2. Theodo
  3. Ahmetbekir
  4. Kubernetes Documentation

Related Posts