Pod eviction in Kubernetes represents a critical stability mechanism where the kubelet terminates pods to protect the overall health of the node. When a node experiences resource exhaustion—whether in the form of memory, disk space, or CPU—the system must make calculated decisions about which workloads to sacrifice to prevent a total node failure. This process is not random; it is governed by a complex hierarchy of Quality of Service (QoS) classes, priority levels, and specific resource thresholds. Understanding these mechanics is essential for maintaining application availability and preventing unexpected downtime in production clusters.
The Architecture of Node-Pressure Eviction
Node-pressure eviction occurs when the kubelet detects that the node is running low on a critical resource. This is a proactive safety measure. If the node were to run out of memory entirely, the Linux kernel's Out-Of-Memory (OOM) killer would take over, which might terminate critical system processes or the kubelet itself, leading to a catastrophic node failure.
The kubelet monitors several resource signals. Two of the most prominent are nodefs (the primary filesystem of the node) and imagefs (the filesystem used for container images). The configuration for these metrics is set automatically to reflect values set for either the nodefs or imagefs, depending on the specific configuration.
In many scenarios, the simple eviction of a single pod only reclaims a small amount of the starved resource. This creates a risk where the kubelet repeatedly hits the configured eviction thresholds, triggering a cascade of multiple evictions in rapid succession. To mitigate this "ping-pong" effect, Kubernetes provides the --eviction-minimum-reclaim flag or a kubelet configuration file. This allows administrators to specify a minimum amount of resource that must be reclaimed before the kubelet stops the eviction process.
For instance, if a configuration is set with the following parameters:
yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "1Gi"
imagefs.available: "100Gi"
evictionMinimumReclaim:
memory.available: "0Mi"
nodefs.available: "500Mi"
imagefs.available: "2Gi"
In this specific configuration, if the nodefs.available signal drops below the threshold, the kubelet will not stop just because it reached the 1GiB mark. It will continue to reclaim resources until the available nodefs storage reaches 1.5GiB (the 1GiB threshold plus the 500MiB minimum reclaim). Similarly, for imagefs, the kubelet will reclaim resources until the available storage reaches 102GiB (100GiB threshold plus 2GiB minimum reclaim).
Preemption and Pod Priority Classes
Preemption is a distinct form of eviction. While node-pressure eviction is about survival, preemption is about priority. Preemption occurs when a high-priority pod cannot be scheduled because no node has enough available resources. To make room for this high-priority workload, the scheduler may evict lower-priority pods from a node.
The likelihood of a pod being preempted is directly tied to its Priority Class. Priority classes are defined as objects in the cluster, and each is assigned a numerical value. Pods with higher values are classified as more important and are less likely to be evicted during preemption.
Current Priority Classes can be queried using the following commands:
kubectl get priorityclasses
kubectl get pc
Common system-level priority classes include:
- system-cluster-critical: 2000000000
- system-node-critical: 2000001000
To implement a custom priority structure, an administrator can define PriorityClass objects. Consider a scenario with two classes: trueberry and falseberry.
yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: trueberry
value: 1000000
globalDefault: false
description: "This fruit is a true berry"
yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: falseberry
value: 5000
globalDefault: false
description: "This fruit is a false berry"
If a pod named "blueberry" is assigned the trueberry class (value 1000000), and pods named "raspberry" and "strawberry" are assigned the falseberry class (value 5000), the scheduler will target raspberry and strawberry for eviction first if it needs to free up space for a high-priority pod.
To assign these priorities, the priorityClassName field must be added to the Pod definition:
priorityClassName: trueberry
Quality of Service (QoS) and Eviction Hierarchy
When node-pressure eviction is triggered, the kubelet does not treat all pods equally. It uses Quality of Service (QoS) classes to rank pods for eviction. This ranking ensures that the most critical workloads are preserved as long as possible.
The kubelet ranks pods for eviction in the following order:
- BestEffort: Pods that have no requests and limits.
- Burstable: Pods where usage exceeds requests.
- Burstable: Pods where usage is below requests.
- Guaranteed: Pods where requests and limits are exactly equal.
Kubernetes will attempt to evict pods from group 1 before moving to group 2, and so on.
The impact of this hierarchy is significant. If a user provides very low requests for their containers, those pods are more likely to be categorized in the higher-risk groups (BestEffort or Burstable exceeding requests), making them primary targets for eviction.
Guaranteed pods are generally the safest. The kubelet will not evict them simply to make room for other pods. However, they are not invincible. If critical system services require more resources to maintain node stability, the kubelet will terminate Guaranteed pods if absolutely necessary, always starting with those that have the lowest priority.
A real-world example of this involves Prometheus server pods. A pod might be configured with the following requests:
Requests: cpu: 500m, memory: 2000Mi
If the pod consumes more than its memory request (e.g., using 2890108Ki), and the node experiences memory pressure, this pod will be evicted quickly—immediately after the BestEffort pods. This highlights the imperative need to set requests and limits correctly. Critical applications should be configured as Guaranteed, most standard applications as Burstable, and non-critical, fault-tolerant applications as BestEffort.
Disk Pressure and Storage-Based Eviction
Disk pressure is a common cause of pod eviction. When the node's local storage (such as that used by EmptyDir volumes) or the image filesystem reaches a critical threshold, the kubelet triggers evictions to reclaim space.
In environments such as VMware Integrated Openstack 7.x, VMware Tanzu Kubernetes Grid Integrated Edition 1.20.0/1.21.0, and Tanzu Mission Control 1.4.0/1.3.1, these issues are presented by upstream Kubernetes constraints and can impact any version using Kubernetes orchestration.
When pods are evicted due to disk pressure, it can lead to a cycle of failures. For example, if a pod is evicted and then rescheduled on the same node, it may immediately trigger disk pressure again if the underlying cause (such as a massive log file) is not addressed.
To identify and remediate disk pressure, administrators can use the following commands:
df -h
find / -xdev -size +10M -print | xargs ls-lS
Once large files are identified—particularly log files or older files from previously removed pods larger than 2GB—they should be deleted or moved to larger storage.
Furthermore, disk pressure can lead to the deletion of docker images. There are two primary scenarios where images may go missing:
- The system executes
docker system prune -a. This command deletes images not used by any running container. For instance, if thecalico-kube-controllerpod is not running on a specific node, the imagevmware/calico/kube-controllers:v3.8.2will be deleted. - Bootstrap images (those starting with
vmware/rather thandocker-registry.default.svc.cluster.local:5000/) cannot be recovered automatically if they are deleted.
To clean up evicted pods in bulk, the following command sequence can be used:
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
Cluster Autoscaler and Safe-to-Evict Annotations
The Cluster Autoscaler (CA) manages the size of a node group based on the resource demands of the pods. To optimize costs, CA identifies "unneeded" nodes that can be scaled down. However, not all pods should be moved, even if the node is underutilized.
In scenarios involving the Kubernetes executor (such as with gitlab-runner on EKS, K8s version 1.3.1, and CA version 1.32), pods can be protected from CA-driven eviction. Certain properties typically stop a pod from being evicted by the Cluster Autoscaler:
- Pods not backed by a controller object (naked pods).
- Pods utilizing local storage (EmptyDir).
- Pods with the annotation
cluster-autoscaler.kubernetes.io/safe-to-evictset tofalse.
When these properties are present, the Cluster Autoscaler should ignore the node holding such pods when determining which nodes are unneeded, thereby preventing the eviction of critical long-running jobs.
Other Mechanisms of Pod Eviction
While node-pressure and preemption are the most frequent causes of eviction, there are other administrative and systemic triggers.
API-Initiated Eviction: This is an on-demand request for eviction. Administrators or external controllers can use the Kubernetes Eviction API to terminate a pod on a specific node.
Taint-Based Eviction: Kubernetes uses Taints and Tolerations to control pod placement. If a NoExecute taint is applied to an existing node, any pods currently running on that node that do not have a matching toleration will be immediately evicted.
Node Drain: This is a management operation used when a node must be taken offline for maintenance or is becoming unusable. Draining a node evicts all pods from it so they can be rescheduled on other available nodes.
Analysis of Eviction Prevention and Recovery
The prevention of pod eviction requires a multi-layered approach focusing on resource definition and priority management. The most critical factor is the alignment of requests and limits. When a pod is configured as "Guaranteed," it signals to the kubelet that the workload is stable and essential, moving it to the bottom of the eviction priority list.
From an infrastructure perspective, the use of minimum reclaim values (--eviction-minimum-reclaim) is vital for preventing "eviction loops." Without a minimum reclaim, a node may evict a pod, reclaim 10MiB of memory, and immediately find itself back under the eviction threshold, leading to the termination of subsequent pods. By enforcing a larger reclaim buffer, the system ensures that once an eviction occurs, the node remains stable for a longer period.
Recovery from eviction is not simply about restarting the pod. If the eviction was caused by disk pressure, the root cause—such as unbounded log growth or the accumulation of temporary files in EmptyDir—must be addressed. Using tools like find to locate files over 10MB and df -h to monitor filesystem health is the primary method for manual recovery.
In summary, pod eviction is a feature, not a bug. It is the mechanism that prevents a single resource-hungry pod from crashing an entire node. By utilizing Priority Classes for business-critical workloads, setting precise QoS classes (Guaranteed, Burstable, BestEffort), and configuring proper reclaim thresholds, operators can transform an unpredictable eviction environment into a stable, self-healing infrastructure.