Orchestrating High Performance Computing with Ansible and Slurm

The deployment of High Performance Computing (HPC) clusters represents one of the most complex challenges in systems administration, requiring the precise orchestration of workload managers, shared filesystems, and specialized networking. Slurm (Simple Linux Utility for Resource Management) stands as the industry standard for cluster management and job scheduling, but manual installation across dozens or thousands of nodes is prone to human error and configuration drift. The integration of Ansible—a powerful automation engine—transforms this process from a manual labor of love into a repeatable, scalable, and programmatic workflow. By utilizing specialized roles and appliances, administrators can transition from basic cluster setup to production-ready environments that include sophisticated monitoring, accounting, and automated image management.

The Architecture of Ansible-Driven Slurm Deployments

The fundamental objective of using Ansible for Slurm is to ensure that every node in the cluster—whether it is a head node, a compute node, or a login node—maintains a consistent state. This is achieved through the definition of host groups and variable sets that determine the role of each machine.

In a typical deployment, the cluster is segmented into functional groups:

Slurm Service Nodes: These hosts are responsible for the control plane. They run the slurmctld (Slurm Controller Daemon) and the slurmdbd (Slurm Database Daemon). The controller manages the queue and resource allocation, while the database daemon handles accounting and user limits.
Slurm Compute Nodes: These are the workhorses of the cluster. They run the slurmd (Slurm Daemon), which executes the actual workloads and reports resource usage back to the controller.
Submit Hosts: These are typically login nodes where users land via SSH to compile code and submit jobs. They do not run the control or compute daemons but must have the Slurm client tools installed to communicate with the controller.

The technical implementation of this segmentation is handled through Ansible's inventory and group variables. For instance, if a node is placed in the slurm_compute group, the automation logic triggers the installation and configuration of slurmd. If it is in the slurm_service group, it triggers slurmctld and slurmdbd.

Comprehensive Analysis of the ansible-role-slurm Framework

The ansible-role-slurm (provided by CSCfi) is designed for flexibility and backwards compatibility, ensuring that the cluster can be deployed across various Linux distributions.

Distribution Support and Compatibility

The role has been rigorously tested on the following operating systems:

CentOS 7: A long-standing standard for HPC environments.
Ubuntu: Providing a Debian-based alternative for clusters.
Ubuntu 18.04: Specifically supported for client-side installations.

To maintain stability across different versions of Slurm, the role employs a version-specific parameter system. This is implemented via the slurm_conf_version_specific_params_list located in vars/slurm_version.yml. This mechanism allows the administrator to inject specific configuration parameters into slurm.conf only when certain versions of Slurm are detected, preventing the deployment of incompatible settings that could crash the controller.

Critical Dependencies and Security Layers

A Slurm cluster is not merely a collection of binaries; it requires a secure and healthy underlying environment. The ansible-role-slurm integrates two critical dependencies:

ansible-role-pam: This role is used to configure the Pluggable Authentication Modules (PAM). This is a vital security layer used to limit access to compute nodes, ensuring that only authorized users and processes can execute on the hardware.
ansible-role-nhc: This integrates the Node Health Checker. In an HPC environment, a "zombie" node that is technically online but malfunctioning (e.g., a failed memory DIMM or a hung CPU) can ruin a massive parallel job. NHC proactively monitors the health of the hardware and communicates with Slurm to drain problematic nodes.

Configuration and Variable Management

The operational logic of the role is driven by variables defined in defaults/main.yml. Key configuration elements include:

Munge: Every node in the cluster must run Munge (MUNGE Uid Manager), which provides the authentication mechanism used by Slurm to verify the identity of users and services across the network.
MySQL Integration: For accounting, Slurm requires a database. The role allows for the deployment of a MySQL server for convenience, typically on the same node as the slurmctld and slurmdbd. A critical security requirement is the mysql_slurm_password, which should be managed via Ansible Vault to ensure the database credentials are not stored in plain text.
Node and Partition Lists: The slurm_nodelist and slurm_partitionlist variables are used to define the physical hardware and the logical queues (partitions) that users will submit jobs to.

High Availability and Advanced Topology

For production environments where downtime is unacceptable, the role supports High Availability (HA) configurations. By defining a slurm_backup_controller variable, administrators can establish a standby controller. However, the technical implementation of HA requires a shared directory—typically via NFS—accessible to both the primary slurm_service_node and the backup controller to synchronize state. It is important to note that the actual setup of the NFS server is considered out of scope for the Slurm role and must be handled by a separate filesystem role.

The Slurm Appliance Approach for Production Environments

While a standard role provides the building blocks, the "Slurm Appliance" (developed by StackHPC) represents a shift toward an "Infrastructure as Code" (IaC) philosophy. It is designed as a plug-and-play software appliance that is hardware-agnostic, meaning it can function on bare-metal nodes or cloud VMs.

Technical Stack and Core Components

The appliance creates a CentOS 8 / OpenHPC v2-based environment. This is a comprehensive ecosystem that integrates several high-level components:

Infrastructure Management: It utilizes OpenTofu (an open-source fork of Terraform) for defining the cluster's infrastructure.
Shared Filesystems: The appliance supports multiple NFS filesystems, which can be hosted internally within the cluster or on external servers. It also supports CephFS via OpenStack Manila for high-performance shared storage.
OpenStack Integration: The appliance is optimized for OpenStack clouds, utilizing OpenStack volumes for persistent state and the creation of instances from pre-built images.

The Monitoring and Observability Pipeline

One of the most advanced features of the Slurm appliance is its integrated monitoring stack, which provides deep visibility into job performance and hardware health.

The pipeline consists of:

Prometheus Node-Exporters: These gather OS-level metrics like CPU load, memory saturation, and network throughput.
Prometheus Server: This acts as the time-series database that scrapes and stores the metrics.
ElasticSearch and Kibana: The appliance uses containerized OpenDistro ElasticSearch for log archiving and Kibana for visualization.
Filebeat: This containerized agent parses log files and ships them to ElasticSearch.
Podman: This is the container engine used to manage the entire monitoring stack (ElasticSearch, Kibana, Filebeat).
Grafana: This serves as the primary visualization layer. It provides dashboards for individual nodes and, crucially, job-specific dashboards that aggregate metrics from all nodes assigned to a particular Slurm job.
Slurm Accounting: A MySQL backend is used by slurmdbd to track resource usage, user limits, and historical job data.

Deployment Workflow and Prerequisites

Deploying the appliance requires a "deploy host" (typically Rocky Linux 8 or 9) with root access and an SSH keypair configured in OpenStack. The process follows these steps:

Install prerequisites: sudo yum install -y git python38.
Clone the repository: git clone https://github.com/stackhpc/ansible-slurm-appliance.
Initialize the environment: Running ./dev/setup-env.sh to prepare the Python virtual environment.
Infrastructure Provisioning: Using OpenTofu to spin up the required instances.

The networking requirement is strict, necessitating three specific security groups:

Default: For intra-cluster communication.
SSH: For external administrative access.
HTTPS: For accessing the Open OnDemand web portal.

Implementation Details and Technical Configuration

For those using the galaxyproject.slurm or CSCfi roles, the implementation varies from a minimal "All-in-One" setup to a complex distributed architecture.

Minimal Setup Example

For development or testing, all services can be hosted on a single node. This is achieved by defining the roles as follows:

yaml - name: Slurm all in One hosts: all vars: slurm_roles: ['controller', 'exec', 'dbd'] roles: - role: galaxyproject.slurm become: True

Production-Grade Configuration

In a production environment, the slurm_cgroup_config is critical for ensuring that jobs do not interfere with one another or crash the node by consuming all available memory.

A typical high-performance configuration looks like this:

yaml - name: Slurm execution hosts hosts: all roles: - role: galaxyproject.slurm become: True vars: slurm_cgroup_config: CgroupMountpoint: "/sys/fs/cgroup" CgroupAutomount: yes ConstrainCores: yes TaskAffinity: no ConstrainRAMSpace: yes ConstrainSwapSpace: no ConstrainDevices: no AllowedRamSpace: 100 AllowedSwapSpace: 0 MaxRAMPercent: 100 MaxSwapPercent: 100 MinRAMSpace: 30 slurm_config: AccountingStorageType: "accounting_storage/none" ClusterName: cluster GresTypes: gpu JobAcctGatherType: "jobacct_gather/none" MpiDefault: none ProctrackType: "proctrack/cgroup" ReturnToService: 1 SchedulerType: "sched/backfill" SelectType: select/linear

This configuration ensures that the cgroup (Control Groups) mechanism is used to constrain CPU and RAM, preventing a single rogue job from impacting the entire node.

Validation, Testing, and Post-Deployment

The complexity of MPI (Message Passing Interface) environments means that a successful Ansible run does not guarantee a working cluster. The StackHPC appliance addresses this by implementing post-deploy tests.

These tests include:

Floating Point Performance: Validating the raw compute power of the nodes.
Bandwidth and Latency: Testing the interconnect speed to ensure no network bottlenecks exist.
Intel MPI Benchmarks: Ensuring the MPI library, launcher, and scheduler are integrated correctly.
High Performance Linpack (HPL): Running the standard benchmark used to rank the world's fastest supercomputers.

Furthermore, the appliance introduces a Packer-based build pipeline. Instead of configuring every node from scratch, Packer is used to create a golden image of a compute node. Slurm is then used to drive the reimaging of nodes, allowing the cluster to be updated or reset to a clean state rapidly.

Comparison of Slurm Deployment Methods

The following table summarizes the differences between the standard Ansible roles and the full appliance approach.

Feature	Ansible-Role-Slurm (CSCfi)	Slurm Appliance (StackHPC)
Primary Goal	Modular installation of Slurm	Production-ready HPC environment
OS Support	CentOS 7, Ubuntu	CentOS 8, Rocky Linux 8/9
Infrastructure	Manual/External	OpenTofu / OpenStack
Monitoring	Basic/None	Prometheus, ElasticSearch, Grafana
Image Management	Manual	Packer-based pipelines
Shared Storage	External requirement	Integrated NFS/CephFS
Validation	Travis CI for role	Post-deploy MPI/HPL benchmarks

Conclusion

The transition from manual cluster setup to Ansible-driven orchestration is essential for any modern HPC facility. The ansible-role-slurm provides a flexible, backwards-compatible framework for those who need a lightweight deployment across diverse Linux distributions. It emphasizes security through PAM integration and health monitoring via NHC.

Conversely, the Slurm Appliance by StackHPC transforms the cluster into a managed product. By integrating OpenTofu for provisioning, Podman for a sophisticated monitoring stack, and Packer for image lifecycle management, it removes the "guesswork" from HPC administration. The integration of a full observability pipeline—from Prometheus node-exporters to Grafana dashboards—allows administrators to move from reactive troubleshooting to proactive performance tuning.

Ultimately, the choice between these methods depends on the scale of the project. For specialized, highly customized clusters, the modular roles offer the necessary granularity. For those seeking a rapid, standardized, and production-ready environment that adheres to the highest standards of observability and performance validation, the appliance model is the superior choice.