Mastering High Availability Orchestration with Ansible and Pacemaker

The pursuit of uninterrupted service availability in enterprise environments necessitates a sophisticated orchestration of clustering software and configuration management. Pacemaker, the industry-standard cluster resource manager, provides the foundational logic for managing resources across multiple nodes, ensuring that services fail over seamlessly when hardware or software malfunctions occur. However, the manual configuration of Pacemaker via the Command Line Interface (CLI) is prone to human error and is difficult to scale across large environments. This is where Ansible enters the architecture. By leveraging Ansible, engineers can transition from manual, imperative configuration to a declarative state, where the desired state of the cluster—including its primitives, constraints, and fencing policies—is defined in code and enforced across the infrastructure.

The integration of Ansible with Pacemaker allows for the programmatic deployment of High Availability (HA) stacks, the precise control of floating IP addresses, and the systematic management of STONITH (Shoot The Other Node In The Head) policies. This synergy is critical for deploying complex workloads, such as SAP environments on Red Hat Enterprise Linux (RHEL), where data integrity and resource exclusivity are non-negotiable. Through the use of specialized Ansible modules and custom roles, administrators can automate the bootstrapping of Corosync, the definition of OCF (Open Cluster Framework) resources, and the implementation of group-based resource dependencies.

Architectural Requirements and Node Topology

Deploying a Pacemaker cluster requires strict adherence to specific node counts and software dependencies to ensure stability and avoid the catastrophic "split-brain" scenario, where two nodes believe they are the master and attempt to write to the same shared storage simultaneously.

The structural requirements for a robust cluster setup are defined as follows:

  • Minimum Node Count: 3 nodes are recommended. This allows for a quorum to be maintained even if one node fails.
  • Maximum Node Count: The architecture supports up to 16 nodes per cluster.
  • Two-Node Configuration: While a 2-node cluster can be configured, it is not recommended due to the difficulty in achieving a majority quorum without an external witness or specialized tie-breaker logic.

From a system-level perspective, the deployment of these roles necessitates specific Python dependencies on the target hosts. Specifically, the passlib library must be installed via pip install passlib to handle password hashing and security requirements during the OS deployment phase.

The following table outlines the critical configuration variables typically found in an ansible-pacemaker defaults file:

Variable Purpose Typical Value/Path
corosync_authkey_file Path to the cluster authentication key /etc/corosync/authkey
corosync_bindnet_interface Network interface used for cluster communication enp0s8
corosync_config_file Main configuration file for Corosync /etc/corosync/corosync.conf

Programmatic Resource Management via the Pacemaker Module

The pacemaker Ansible module provides a high-level abstraction for the crm command, allowing users to manage resources without manually invoking shell scripts. This module supports the creation, modification, and deletion of resources by defining their state as present or absent.

Managing Cluster Properties and STONITH

One of the most critical aspects of cluster management is the handling of the quorum policy and fencing. STONITH (Shoot The Other Node In The Head) is used to ensure that a faulty node is physically powered off or fenced before another node takes over its resources, preventing data corruption.

To disable STONITH or modify the quorum policy using the pacemaker module, the following playbook structure is utilized:

yaml - name: test hosts: controller become: yes serial: 1 ignore_target_role: false tasks: - name: disable stonith pacemaker: > resource='property no-quorum-policy="ignore" stonith-enabled="false"' state=present

In this technical implementation, the property attribute is used to set stonith-enabled to false and the no-quorum-policy to ignore. This is often done in development environments where hardware fencing is unavailable, though in production RHEL environments, STONITH is mandatory to ensure data integrity.

Floating IP Implementation and Modification

The deployment of a Virtual IP (VIP) is a cornerstone of HA architectures, allowing clients to connect to a single IP that floats between active nodes. The ocf:heartbeat:IPaddr2 provider is the standard for this operation.

To define a floating IP, the module is configured as follows:

yaml - name: define floating IP pacemaker: resource: > primitive test_vip ocf:heartbeat:IPaddr2 params ip="192.168.33.200" cidr_netmask="24" nic="port-ctl" state: present

If the IP address needs to be updated, the same resource name is referenced with the new parameters:

yaml - name: change floating IP pacemaker: resource: > primitive test_vip ocf:heartbeat:IPaddr2 params ip="192.168.33.100" cidr_netmask="24" nic="port-ctl" state: present

To remove the resource entirely, the state is changed to absent:

yaml - name: remove floating IP pacemaker: resource: > primitive test_vip ocf:heartbeat:IPaddr2 params ip="192.168.33.100" cidr_netmask="24" nic="port-ctl" state: absent

Advanced Transactional Changes with Shadow Copies

For complex configurations where multiple resources must be modified simultaneously without causing intermediate cluster instabilities, the "shadow" property is utilized. This method involves a three-step process: prepare, resource modification, and commit. This creates a shadow copy of the running configuration, allowing the administrator to stage all changes before applying them atomically to the cluster.

Complex Service Orchestration: Nginx and HAProxy

Beyond simple IP addresses, Pacemaker is used to manage full application stacks. This involves defining primitives and grouping them to ensure they always start on the same node.

Nginx Cluster Integration

An automated Nginx deployment via Ansible involves specifying the OCF provider and monitoring intervals to ensure the service is healthy. The following configuration parameters are typically used:

  • Provider: ocf:heartbeat:nginx
  • Config File: /etc/nginx/nginx.conf
  • Monitor Interval: 5s
  • Monitor Timeout: 5s

Additionally, cluster-wide settings are often tuned to prevent excessive failure counts from triggering a node reboot. These include:

  • start-failure-is-fatal: set to false
  • pe-warn-series-max: 1000
  • pe-input-series-max: 1000
  • pe-error-series-max: 1000
  • cluster-recheck-interval: 5min

HAProxy Stack Deployment via File Templating

For more advanced scenarios, such as HAProxy, users can utilize a hybrid approach: using Ansible to template a configuration file and the shell module to load it into the cluster. This is particularly useful for complex stacks involving multiple primitives and groups.

Example of a haproxy-stack configuration file:

text primitive haproxy ocf:heartbeat:haproxy params conffile=/etc/haproxy/haproxy.cfg op monitor interval=30s primitive haproxy-ip ocf:heartbeat:IPaddr2 params ip="10.162.20.107" cidr_netmask="24" op monitor interval="2s" group haproxy-stack haproxy-ip haproxy

Once this file is pushed to the cluster nodes via Ansible, the configuration is loaded using the following command:

bash crm configure load update haproxy-stack

This method ensures that the haproxy-ip and the haproxy service are treated as a single unit (a group), ensuring they are always colocated on the same physical node.

Operational Maintenance and Resource Migration

A primary benefit of a Pacemaker-managed environment is the ability to perform maintenance without downtime. This is achieved through the controlled migration of resources.

Patching and Node Rotation

When a node requires patching, the administrator must migrate the active resources to another node to prevent service interruption. The general workflow is:

  1. Identify the active node.
  2. Set the target node to an "inactive" or "maintenance" mode.
  3. Migrate resources from the active node to the inactive node.
  4. Perform the patch on the original server.
  5. Bring the server back online and repeat the process for the second node.

To verify the current distribution of resources and the status of the nodes, the following command is used:

bash pcs status

Conclusion: Analysis of Automated HA Frameworks

The transition from manual cluster management to Ansible-driven orchestration represents a critical evolution in infrastructure stability. By moving the definition of the cluster into YAML-based playbooks, organizations eliminate the "snowflake" server problem, where cluster configurations diverge over time due to manual tweaks.

The use of the pacemaker module and the crm command integration allows for a granular level of control over the cluster's behavior. The ability to define OCF primitives, set specific monitoring intervals, and manage STONITH policies programmatically ensures that the environment remains resilient against both hardware failure and human error. Furthermore, the implementation of resource grouping (as seen in the HAProxy example) prevents the "split-service" problem, where a database might move to one node while the application server remains on another, which would otherwise cause significant latency or connectivity failures.

Ultimately, the synergy between Ansible and Pacemaker provides a scalable framework for High Availability. Whether deploying a small 3-node cluster or a massive 16-node enterprise environment, the ability to treat cluster configuration as code allows for rapid recovery, consistent deployments, and the rigorous enforcement of data integrity through automated fencing and quorum management.

Sources

  1. yosshy/ansible-pacemaker GitHub
  2. mrlesmithjr/ansible-pacemaker GitHub
  3. Red Hat Blog - RHEL Pacemaker Cluster
  4. Ansible Forum - Pacemaker Config

Related Posts