Filebeat Kubernetes Log Harvesting Architecture

The deployment of Filebeat within a Kubernetes environment represents a critical infrastructure decision for organizations seeking to implement centralized logging. By leveraging Filebeat Docker images, operators can efficiently retrieve and ship container logs from a distributed cluster to a centralized Elasticsearch instance. This process is not merely about moving data; it is about the systemic capture of ephemeral container logs and their transformation into searchable, structured intelligence. In a Kubernetes ecosystem, where pods are transient and nodes are dynamic, the mechanism for log collection must be as scalable and resilient as the orchestration layer itself.

DaemonSet Deployment Logic

To ensure that log collection is comprehensive across the entire cluster, Filebeat is deployed as a DaemonSet. A DaemonSet is a Kubernetes workload object that ensures all (or some) nodes run a copy of a specific pod. In the context of Filebeat, this architectural choice is mandatory because container logs are stored locally on the node where the pod was executed. If Filebeat were deployed as a standard Deployment or a StatefulSet, it would not guarantee presence on every node, resulting in "blind spots" where logs from specific nodes are never harvested.

The impact of using a DaemonSet is the guarantee of 100% coverage. Every node added to the cluster automatically triggers the creation of a Filebeat pod, ensuring that no container log is left behind. This creates a dense web of observability where the lifecycle of the Filebeat pod is tied directly to the lifecycle of the Kubernetes node.

Log Harvesting and Volume Mounting

The operational core of Filebeat on Kubernetes is the mounting of the host's container log directory. Specifically, the directory /var/log/containers is mounted into the Filebeat container. This mount allows Filebeat to reach outside its own isolated container boundary and access the logs generated by the container runtime on the underlying host.

Once the mount is established, Filebeat initializes an input for the files located in this directory. The agent begins harvesting these logs as soon as they appear in the folder. This real-time harvesting is critical for troubleshooting, as it minimizes the latency between a log event occurring in a production pod and that event becoming visible in a dashboard.

The relationship between log paths is complex. The path /var/log/containers/*.log typically functions as a symlink to /var/log/pods/*/*.log. This structure is a byproduct of how Kubernetes and the container runtime organize logs. Consequently, depending on the specific runtime configuration, these paths can be edited to point directly to the pod log directory to optimize access or meet specific security requirements.

Namespace Configuration and Manifest Deployment

By default, the Filebeat deployment is configured to reside within the kube-system namespace. This namespace is typically reserved for objects that are critical to the operation of the Kubernetes cluster. Placing Filebeat here aligns it with other system-level agents, ensuring it has the necessary visibility and priority. However, the manifest file is fully customizable, allowing administrators to modify the namespace if their organizational policy requires a dedicated logging namespace.

To initiate the deployment process, the official manifest file must be retrieved. This is achieved using the curl command:

curl -L -O https://raw.githubusercontent.com/elastic/beats/8.19/deploy/kubernetes/filebeat-kubernetes.yaml

Once the manifest is downloaded and any necessary configurations are applied, the deployment is executed via the kubectl utility:

kubectl create -f filebeat-kubernetes.yaml

To verify the operational status of the DaemonSet and ensure that the desired number of pods are running across the cluster, the following command is utilized:

kubectl --namespace=kube-system get ds/filebeat

The output of this command provides a structured view of the deployment:

Column Description
NAME The name of the DaemonSet (filebeat)
DESIRED The number of pods that should be running
CURRENT The number of pods currently running
READY The number of pods that have passed readiness probes
UP-TO-DATE The number of pods updated to the latest version
AVAILABLE The number of pods available to the cluster
NODE-SELECTOR Any specific node selection constraints applied
AGE The duration since the DaemonSet was created

Metadata Enrichment and JSON Parsing

One of the most powerful features of Filebeat on Kubernetes is the add_kubernetes_metadata processor. Raw logs from a container are often devoid of context; they tell you what happened, but not where it happened. The add_kubernetes_metadata processor intercepts the log event and annotates it with critical Kubernetes-specific metadata, such as the pod name, namespace, container name, and labels. This transforms a simple line of text into a rich, queryable event.

Furthermore, modern Kubernetes workloads frequently employ structured logging, typically in JSON format. When applications log in JSON, the logs are not just strings but structured data objects. Filebeat can be configured with special handling to parse these JSON logs properly. Instead of treating a JSON blob as a single message string, Filebeat decodes the JSON into individual fields. This enables high-precision filtering and aggregation in Elasticsearch, allowing users to query specific fields within the application logs rather than relying on expensive full-text searches.

Legacy Version Support and Data Persistence

For environments running legacy versions of Kubernetes, specifically version 1.7 or earlier, the mechanism for persisting internal data differs. In these older versions, Filebeat utilizes a hostPath volume to ensure that internal data is persisted across pod restarts. This data is located under the path /var/log/filebeat-data.

A critical distinction exists regarding folder creation. The modern manifest utilizes DirectoryOrCreate for folder autocreation, a feature introduced in Kubernetes 1.8. Users running Kubernetes 1.7 or earlier must perform two manual steps to avoid deployment failure:

  1. Remove the type: DirectoryOrCreate line from the manifest file.
  2. Manually create the host folder on every node.

This manual requirement is an impact of the architectural evolution of Kubernetes volume management, where the orchestrator eventually took over the responsibility of ensuring directory existence.

Control Plane Integration and Tolerations

By default, Kubernetes control plane nodes are often protected by taints. Taints are properties applied to nodes that allow the node to repel a set of pods. This ensures that critical system workloads on the control plane are not interrupted by general application pods. Because of this, a standard Filebeat DaemonSet may not be scheduled on control plane nodes, meaning logs from system components (like the API server or scheduler) would not be collected.

To resolve this and enable log collection from the control plane, the DaemonSet specification must be updated to include proper tolerations. A toleration allows the pod to schedule onto a node with a matching taint. The required configuration is:

yaml spec: tolerations: - key: node-role.kubernetes.io/control-plane effect: NoSchedule

This configuration informs the Kubernetes scheduler that Filebeat is permitted to run on the control plane despite the NoSchedule effect, thereby closing the visibility gap for the most critical parts of the cluster.

Platform-Specific Configurations

Different Kubernetes distributions require specific adjustments to the Filebeat manifest. Red Hat OpenShift, for instance, implements stricter security contexts. To run Filebeat on OpenShift, additional settings must be specified in the manifest file, and the container must be explicitly enabled to run as privileged. This is necessary because Filebeat requires access to the host's log files and system information, which are restricted by default in the hardened OpenShift environment.

Additionally, depending on the container runtime used—such as CRI-O or containerd—the log paths may vary. It is imperative to configure the correct path for log harvesting. If autodiscovery is enabled, the configuration must be precisely tuned to ensure Filebeat can find the containers. The required configuration for autodiscovery is:

yaml filebeat.autodiscover: providers: - type: kubernetes node: ${NODE_NAME} hints.enabled: true hints.default_config: type: container paths: - /var/log/containers/*.log

This configuration ensures that Filebeat dynamically discovers new pods as they are created and applies the correct log harvesting paths based on the node it is currently running on.

Elasticsearch Integration and Destination Management

By default, Filebeat is designed to send events to an existing Elasticsearch deployment. However, in many enterprise environments, the Elasticsearch cluster is hosted externally or on a separate infrastructure. To specify a different destination, the following parameters in the manifest file must be modified:

  • name: ELASTICSEARCH_HOST
    Value: elasticsearch (Replace with the actual hostname or IP)
  • name: ELASTICSEARCH_PORT
    Value: "9200"
  • name: ELASTICSEARCH_USERNAME
    Value: elastic
  • name: ELASTICSEARCH_PASSWORD
    Value: changeme

These environment variables dictate exactly where the harvested logs are shipped. If these are incorrectly configured, the Filebeat pods may show as "Ready" in Kubernetes, but log events will fail to reach the destination, leading to a silent failure in the observability pipeline.

Ingest Pipelines and Kibana Visualization

The integration between Filebeat and the Elastic Stack involves more than just shipping logs; it involves the setup of ingest pipelines. By default, these ingest pipelines are set up automatically the first time Filebeat is run and connected to Elasticsearch. These pipelines are responsible for the server-side parsing and transformation of log lines before they are indexed.

To complement the data collection, Filebeat provides various pre-built Kibana dashboards. These dashboards are designed specifically for Kubernetes environments, providing a visual representation of cluster health and log trends. If these dashboards are not present in Kibana, they must be loaded manually. This requires installing Filebeat on any system that has connectivity to the Elastic Stack and executing the setup command:

filebeat setup

It is important to note a critical technical detail: the setup command does not load the ingest pipelines used to parse log lines. The ingest pipelines are handled during the initial connection and runtime execution, whereas the setup command focuses specifically on the visualization layer in Kibana.

Analysis of Operational Impact

The deployment of Filebeat on Kubernetes transforms the logging paradigm from a fragmented, per-node search to a centralized, queryable data lake. The reliance on the DaemonSet architecture ensures that the scalability of the logging infrastructure matches the scalability of the application infrastructure. When a new node is added to the cluster, the logging capacity expands automatically without manual intervention.

The combination of add_kubernetes_metadata and JSON parsing represents the shift from "logging" to "observability." Without these features, logs are merely timestamps and messages. With them, logs become part of a structured dataset that can be correlated with Kubernetes events, pod lifecycles, and resource utilization.

However, the complexity of the deployment—ranging from the need for privileged containers in OpenShift to the requirement for tolerations on control plane nodes—highlights the necessity of a deep understanding of the Kubernetes security and scheduling model. The operational success of Filebeat depends not only on the agent itself but on the correct alignment of the manifest with the cluster's underlying runtime (CRI-O, containerd) and version (Kubernetes 1.7 vs 1.8+).

Ultimately, the use of Filebeat on Kubernetes creates a robust feedback loop. By harvesting logs from /var/log/containers and shipping them to Elasticsearch, developers and operators gain an authoritative view of their distributed systems, enabling faster root-cause analysis and more reliable system performance.

Sources

  1. Elastic Guide - Running Filebeat on Kubernetes

Related Posts