Engineering Production-Grade Kubernetes Clusters with Ansible: From Bare Metal to GitOps Automation

The orchestration of Kubernetes clusters represents one of the most complex challenges in modern infrastructure engineering. While managed services like Azure AKS or Google GKE simplify the control plane, the requirement for absolute sovereignty over the infrastructure—whether for cost-reduction, security, or specific regulatory compliance—often leads engineers toward "The Hard Way." However, manual installation is unsustainable in production environments. The integration of Ansible into the Kubernetes lifecycle transforms the deployment from a fragile, manual process into a repeatable, idempotent, and version-controlled operation. By leveraging Ansible's configuration management capabilities, architects can bridge the gap between raw virtual machine provisioning and a fully functional, production-ready container orchestration platform.

Architectural Foundations and Hybrid Cloud Networking

Implementing a Kubernetes cluster requires a robust networking foundation, especially when the nodes are geographically dispersed or hosted across different providers. A sophisticated approach involves a hybrid cloud model, combining the reliability of professional data centers with the cost-efficiency of local hardware.

For instance, utilizing Hetzner Cloud provides the necessary fixed IP addresses and stable entry points required for mail servers and external cluster access. Simultaneously, integrating local machines into the cluster allows for a reduction in monthly operational expenditures. The critical technical layer that enables this disparate hardware to function as a single cohesive unit is the WireGuard VPN.

WireGuard serves as the secure communication tunnel between all virtual machines, effectively placing every node on a single virtual subnet regardless of its physical location. This is mandatory for Kubernetes services to communicate securely between hosts. From a technical perspective, this removes the need for complex NAT configurations and exposes the internal K8S API and etcd traffic only to authorized nodes. The real-world impact is a cluster that possesses the flexibility of the cloud but the cost profile of on-premises hardware.

While these configurations are often tailored for specific providers like Hetzner, the resulting Ansible roles are designed for portability. With minimal modifications, these roles can be deployed on other Infrastructure-as-a-Service (IaaS) providers such as Scaleway or Digital Ocean. The primary requirement is a systemd-based Linux operating system, with specific validation performed on Ubuntu 20.04 and 22.04.

Detailed Ansible Directory Structure and Best Practices

A production-grade Ansible deployment must follow a strict directory layout to ensure maintainability and scalability. A chaotic directory structure leads to configuration drift and makes troubleshooting nearly impossible during a cluster failure.

The following table outlines the critical components of an expertly structured Ansible environment:

Component	Purpose	Technical Detail
`group_vars`	Variable Management	Contains YAML files like `k8s_all.yml` and `k8s_worker.yml` to define cluster-wide and node-specific parameters.
`host_vars`	Individual Node Tuning	Stores specific configurations for unique VMs, such as `k8s-010101.i.domain.tld` for etcd nodes.
`roles`	Modular Logic	Encapsulates reusable logic for `containerd`, `etcd`, `cilium`, and `cert-manager`.
`playbooks`	Execution Logic	Contains the top-level YAML files that orchestrate the roles.
`certificates`	Security Assets	Stores the PKI (Public Key Infrastructure) assets generated via `cfssl`.

The inclusion of a .envrc file and a factscache ensures that environment variables are managed securely and that Ansible does not need to re-gather facts from every node on every run, significantly increasing execution speed.

The role-based architecture allows for granular control. For example, the githubixx.ansible_role_wireguard handles the network tunnel, while githubixx.etcd manages the distributed key-value store. This modularity ensures that if a specific component like the CNI (Container Network Interface) needs an upgrade, the operator can target the githubixx.cilium_kubernetes role without risking the stability of the core OS configuration.

Pre-Deployment Node Preparation and Dependency Management

Before a Kubernetes binary can be executed, the underlying Linux host must be hardened and configured to meet the specific requirements of the Kubernetes Container Runtime Interface (CRI). This is achieved through a comprehensive dependency playbook, such as kube_dependencies.yml.

The preparation process involves several critical technical steps:

System Updates and State Reset
The playbook begins by updating the apt cache to ensure all security patches are current. A reboot is performed to ensure the system is running the latest kernel and that no pending updates interfere with the installation.
Memory Management (SWAP Removal)
Kubernetes requires that swap be disabled to ensure the kubelet can accurately manage resource limits and avoid unpredictable performance degradation.

The command swapoff -a is executed to disable swap immediately.
The /etc/fstab file is modified using the replace module to comment out swap entries, ensuring that swap remains disabled after a system reboot.

Container Runtime Configuration (containerd)
The containerd runtime requires specific kernel modules to be loaded for overlay networking and bridge filtering.

An empty file is created at /etc/modules-load.d/containerd.conf.
The blockinfile module adds the overlay and br_netfilter modules to this configuration.

System Kernel Parameters (sysctl)
To enable the network bridge to filter traffic, specific sysctl parameters must be set. This is managed by creating the /etc/sysctl.d/99-kubernetes-cri.conf file and populating it with the necessary network configurations.

The operational impact of these steps is a stable, predictable environment where the container runtime can operate without interference from the Linux kernel's default memory or networking behaviors.

Implementation of the Control Plane and Worker Nodes

The deployment of the Kubernetes cluster follows a structured progression from the etcd layer to the control plane and finally the worker nodes. This is typically modeled after "Kubernetes the Hard Way," but automated via Ansible for production viability.

Resource Requirements

For a functional, stable environment, the following minimum hardware specifications are required:

Master Node: 2GB CPU & 4GB RAM
Worker Nodes: 2GB CPU & 4GB RAM (per node)

Connectivity and Inventory Management

Secure access is established using SSH keys. The Ansible Control Node must copy its public key to each node using the ssh-copy-id command. The inventory file, located at ~/ansible/inventory/kube_inventory, organizes the nodes into functional groups:

```text
[master]
10.x.x.x

[workers]
10.x.x.x
10.x.x.x
```

This grouping allows the operator to target playbooks specifically to the master or the workers, which is essential when performing rolling updates or scaling the cluster.

Advanced CI/CD Integration: Jenkins and GitOps

Once the cluster is established, the focus shifts to application delivery. Ansible serves as the bridge between the build phase and the deployment phase. There are two primary methodologies for integrating Ansible into a CI/CD pipeline.

The Jenkins Pipeline Approach

In this model, Ansible is used as a direct deployment tool within a Jenkins pipeline. The Jenkinsfile defines a series of stages: Checkout, Build, Test, and Deploy.

The Deploy stage triggers an Ansible playbook using the ansiblePlaybook script:

groovy stage('Deploy') { echo 'Deploying application...' script { ansiblePlaybook( playbook: 'ansible/deploy-app.yml' ) } }

The corresponding Ansible playbook utilizes the kubernetes.core.k8s module to manage resources idempotently:

```yaml

hosts: proxyserver
gatherfacts: no
tasks:
- name: Set up K8S Namespace
  kubernetes.core.k8s:
  state: present
  apiVersion: v1
  kind: Namespace
  metadata:
  name: my-namespace
- name: Deploy Application
  kubernetes.core.k8s:
  state: present
  definition: "{{ lookup('file', 'kubernetes/deployment.yml') | from_yaml }}"
```

This approach provides a scriptable, hands-on method for managing deployments, where Ansible communicates directly with the Kubernetes API to ensure the desired state is reached.

The GitOps Approach with ArgoCD

For high-maturity organizations, Ansible is integrated with GitOps tools like ArgoCD or Flux. In this architecture, ArgoCD is responsible for syncing the state of the cluster with a Git repository. Ansible's role shifts from "deployer" to "manifest generator."

Ansible is used in the pre-processing phase to dynamically generate Kubernetes manifests using Jinja2 templates based on environment-specific variables. These manifests are then committed to Git, which ArgoCD detects and applies to the cluster.

To create an application in ArgoCD via the CLI, the following command is used:

bash argocd app create k8s-app-prod \ --repo https://github.com/username/your-repo.git \ --path manifests \ --dest-server https://kubernetes.default.svc \ --dest-namespace default

The application is then synchronized:

bash argocd app sync k8s-app-prod

This creates a virtuous cycle: Ansible ensures the manifest is correct for the environment, Git provides the audit trail, and ArgoCD ensures the cluster matches the Git state.

Multi-Cluster Management and Cloud Integration (Azure AKS)

In enterprise environments, a single cluster is rarely sufficient. Managing multi-cluster environments requires a strategy for handling multiple kubeconfig files and contexts.

When utilizing managed services like Azure AKS, the setup process involves configuring a proxy server to act as the gateway to the cloud cluster. The following sequence is executed on the proxy:

bash curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash az login az aks get-credentials --name name_of_aks_cluster --resource-group name_of_aks_rg kubectl get nodes kubectl get all -A

To manage these clusters via Ansible, the k8s_auth parameter is used to specify the exact kubeconfig file, preventing conflicts between different cluster environments:

yaml - name: Set Kubernetes context k8s_auth: kubeconfig: /path/to/kubeconfig register: kube_auth

This allows an Ansible Control Node to execute playbooks against a variety of clusters—ranging from bare-metal home labs to enterprise AKS instances—by simply switching the configuration context.

Idempotency and Resource Lifecycle Management

The primary technical advantage of using Ansible for Kubernetes is idempotency. In the context of Kubernetes, this means that running a playbook multiple times will not create duplicate resources or cause unnecessary restarts if the current state matches the desired state.

Ansible leverages the Kubernetes API to perform complex operations such as:

Rolling Updates: Deploying updates to pods one by one to ensure zero downtime.
Canary Deployments: Introducing a new version of an application to a small subset of users before a full rollout.
Namespace Management: Using the kubernetes.core.k8s module to ensure a namespace exists before attempting to deploy applications into it.

The impact of this capability is the elimination of "configuration drift," where manual changes to the cluster lead to discrepancies between the actual state and the documented state. By defining the cluster in YAML and applying it via Ansible, the infrastructure becomes a documented, reproducible asset.

Conclusion

The synergy between Ansible and Kubernetes transforms the process of cluster management from a series of manual, error-prone tasks into a sophisticated software engineering discipline. By implementing a structured directory layout, rigorous node preparation, and a hybrid networking strategy via WireGuard, operators can build production-ready clusters that are both flexible and resilient. Whether employing a direct Jenkins-based deployment pipeline or a sophisticated GitOps workflow with ArgoCD, the use of Ansible ensures that the Kubernetes lifecycle—from the initial swapoff command to the final canary deployment—is handled with precision. The transition from "The Hard Way" to "The Ansible Way" allows for the scale and reliability required by modern enterprise applications while maintaining the granular control that only a self-managed cluster can provide.