Engineering High-Performance Computing Clusters with Slurm and Ansible Automation

The orchestration of High-Performance Computing (HPC) environments has historically been a laborious process, often characterized by manual configuration and a high propensity for human error. The integration of Slurm, the industry-standard workload manager, with Ansible, a powerful automation engine, represents a paradigm shift in how these complex environments are deployed and maintained. By transitioning from manual "hand-crafted" clusters to an Infrastructure-as-Code (IaC) approach, organizations can achieve a level of repeatability and scalability that was previously unattainable. This synergy allows for the rapid deployment of production-ready environments, ensuring that critical components—such as the Slurm controller, database daemons, and compute nodes—are configured consistently across the entire fabric.

At the core of this evolution is the concept of the "Slurm appliance," a modular software framework designed to be plug-and-play. Unlike traditional hardware appliances, this software-defined approach is hardware-agnostic, meaning it can be deployed on general-purpose cloud virtual machines (VMs) or dedicated bare-metal HPC nodes. The use of Ansible roles, such as those provided by StackHPC and the Galaxy project, allows administrators to define the desired state of the cluster in YAML, which Ansible then enforces across all target hosts. This methodology eliminates the "world of pain" associated with manual source builds and complex dependency management, providing a structured path from a blank server to a fully operational HPC cluster.

The StackHPC Slurm Appliance Architecture

The Slurm appliance developed by StackHPC is designed to provide a comprehensive, production-ready workload management environment. It is built upon the OpenHPC project, utilizing a CentOS 8 / OpenHPC v2 foundation to ensure stability and compatibility with scientific computing libraries.

The technical architecture is designed for modularity, allowing administrators to specify exactly which services run on which hosts. This flexibility is critical for scaling, as it allows the separation of the controller and database roles from the execution nodes.

The appliance integrates several critical components to ensure a complete HPC lifecycle:

Slurm Workload Manager: The primary engine for job scheduling and resource allocation.
MySQL Backend: Utilized for Slurm accounting, providing a persistent record of job history, user utilization, and resource consumption.
Network File System (NFS): Supports multiple filesystems, which can be hosted internally within the appliance-managed cluster or provided by external enterprise storage servers.
Open OnDemand: A web-based portal that provides a graphical interface for users to interact with the cluster, submitting jobs and managing files without requiring deep CLI knowledge.
Packer Integration: A build pipeline based on Packer is used to create standardized compute node images, ensuring that every node boots from a known-good state.
Slurm-driven Reimaging: The ability to trigger the reimaging of compute nodes directly through Slurm, which is essential for maintaining node health and updating software stacks across thousands of cores.

Infrastructure Deployment and Cloud Integration

The deployment of a Slurm cluster via Ansible requires a robust underlying infrastructure. StackHPC has focused heavily on OpenStack integration, although the system is designed to be portable to any cloud provider through the use of OpenTofu.

The technical requirements for the deployment host and the target cloud environment are stringent to ensure the stability of the resulting cluster.

Deployment Host Requirements

The deploy host serves as the orchestration engine. It must run a supported operating system, specifically Rocky Linux 8 or Rocky Linux 9. To prepare the environment, the following sequence must be executed:

sudo yum install -y git python38
git clone https://github.com/stackhpc/ansible-slurm-appliance
cd ansible-slurm-appliance
git checkout ${latest-release-tag}
./dev/setup-env.sh

Following the environment setup, OpenTofu must be installed to manage the infrastructure-as-code components.

Cloud and Network Configuration

For a successful deployment on OpenStack, the following technical parameters must be met:

Volume Storage: Persistent state must be backed by an OpenStack volume, and the shared NFS filesystem must be backed by a separate OpenStack volume to ensure data durability.
Access Control: The deploy host must possess root access and a valid SSH keypair defined in OpenStack.
Time Synchronization: Instances must have accurate, synchronized time. While hypervisors typically provide this for VMs, bare-metal instances may require the appliance to configure a dedicated time service.
Connectivity: All instances must have internet access, although the appliance can be configured to handle proxy settings if the environment is behind a corporate firewall.

Security Group Specifications

Network security is managed through three distinct security groups to isolate traffic and protect the cluster:

default: This group allows all intra-cluster communication, ensuring that the controller can communicate with the compute nodes.
SSH: This group allows external access via SSH and HTTPS for administrative purposes.
Open OnDemand: This group specifically allows traffic for the web-based portal.

Monitoring and Observability Stack

A production HPC environment requires deep visibility into both hardware performance and job efficiency. The Slurm appliance deploys a complex but integrated monitoring stack that leverages a combination of Prometheus, ElasticSearch, and Grafana.

The monitoring pipeline is implemented as follows:

Prometheus Node-Exporters: These are deployed on every node to gather hardware and OS-level metrics, including CPU utilization, memory pressure, and network throughput.
Prometheus Server: This server scrapes data from the node-exporters and stores it in a time-series database.
Slurm Database Daemon: Working with a MySQL server, this provides the accounting data necessary to link hardware metrics to specific Slurm jobs.
ElasticSearch: Deployed as an OpenDistro container, this serves as the archival and retrieval system for all cluster log files.
Filebeat: A containerized agent that parses local log files and ships them to the ElasticSearch cluster.
Kibana: A containerized visualization tool used for searching and analyzing the logs stored in ElasticSearch.
Podman: The underlying container engine used to manage and orchestrate the ElasticSearch, Kibana, and Filebeat containers.
Grafana: The primary dashboarding tool that consumes data from both Prometheus and the Slurm database.

This integrated stack allows for job-specific dashboards. When a user clicks on a specific job in Grafana, the system aggregates metrics from all nodes involved in that job, providing a real-time view of CPU and network usage. This is invaluable for debugging performance bottlenecks in MPI-based applications.

Advanced Slurm Configuration and Customization

The configuration of Slurm is often a complex task involving numerous parameters. Using Ansible roles, such as galaxyproject.slurm, allows these configurations to be version-controlled and applied programmatically.

Core Configuration Parameters

The Slurm configuration is typically handled via a set of variables passed to the Ansible role. The slurm_config hash can define critical cluster behaviors:

ClusterName: The unique identifier for the cluster.
SelectType: Defines how resources are allocated to jobs.
SchedulerType: Defines the scheduling algorithm (e.g., sched/backfill).
ProctrackType: Specifies how processes are tracked, with proctrack/cgroup being a common production choice.
AccountingStorageType: Determines where accounting data is stored (e.g., accounting_storage/mysql).

Resource Management and Cgroups

To prevent a single job from consuming all system resources and crashing a node, the appliance leverages Cgroups (Control Groups). The slurm_cgroup_config allows for fine-grained control:

CgroupMountpoint: Typically set to /sys/fs/cgroup.
ConstrainCores: When enabled, this ensures that jobs are restricted to the cores allocated to them.
ConstrainRAMSpace: Ensures that jobs cannot exceed their requested memory allocation.
MaxRAMPercent: Sets the maximum percentage of RAM a job can utilize.
MinRAMSpace: Defines the minimum amount of RAM reserved for system processes.

Specialized Configuration Files

Beyond the main slurm.conf, the Ansible role supports the deployment of auxiliary configuration files:

acct_gather.conf: Configured via the slurm_acct_gather_config hash.
cgroup.conf: Configured via the slurm_cgroup_config hash.
gres.conf: Generic Resource (GRES) configuration, handled by the slurm_gres_config list of hashes. This is essential for clusters utilizing GPUs.

Deployment Strategies: From Minimal to Extensive

Depending on the goals of the administrator, the Slurm deployment can range from a single-node testbed to a massive multi-node production cluster.

Minimal All-in-One Setup

For development or testing, all Slurm services can be colocated on a single node. This is achieved by assigning all roles to the host in the Ansible playbook:

yaml - name: Slurm all in One hosts: all vars: slurm_roles: ['controller', 'exec', 'dbd'] roles: - role: galaxyproject.slurm become: True

Production-Scale Deployment

In a production environment, roles are split across multiple hosts to ensure high availability and performance. The galaxyproject.slurm role requires root access, meaning the become: True directive must be used.

A more extensive configuration for execution hosts would include detailed Cgroup and resource settings to ensure stability:

yaml - name: Slurm execution hosts hosts: all roles: - role: galaxyproject.slurm become: True vars: slurm_cgroup_config: CgroupMountpoint: "/sys/fs/cgroup" CgroupAutomount: yes ConstrainCores: yes TaskAffinity: no ConstrainRAMSpace: yes ConstrainSwapSpace: no ConstrainDevices: no AllowedRamSpace: 100 AllowedSwapSpace: 0 MaxRAMPercent: 100 MaxSwapPercent: 100 MinRAMSpace: 30 slurm_config: AccountingStorageType: "accounting_storage/none" ClusterName: cluster GresTypes: gpu JobAcctGatherType: "jobacct_gather/none" MpiDefault: none ProctrackType: "proctrack/cgroup" ReturnToService: 1 SchedulerType: "sched/backfill"

Manual Build Considerations and Automation

While the appliance provides a streamlined path, some users may choose to build Slurm from source. This process is notoriously difficult and prone to errors.

When building from source, the use of the ./configure command is critical. Key flags include:

--prefix: This sets the installation directory. It is highly recommended to use a versioned path such as /opt/slurm_version_build_version during the testing phase. This allows the administrator to build multiple versions with different options and easily delete failed attempts before finalizing the production path at /opt/slurm_version.
--with-systemdsystemunitdir=DIR: This specifies the directory for the systemd service files for all Slurm daemons, ensuring that the OS can manage the Slurm processes correctly.

The consensus among experts is that manual builds should be avoided in favor of Ansible or other automation tools. Automating the build process ensures that the installation is repeatable and reduces the risk of configuration drift.

Post-Deployment Validation and Testing

Deploying the software is only the first step; ensuring that the MPI (Message Passing Interface) environment is functioning correctly is a separate, often difficult challenge. Incompatibilities between the compiler, MPI library, MPI launcher, and the Slurm scheduler can lead to silent failures or degraded performance.

The Slurm appliance addresses this by incorporating post-deploy MPI-based tests. These tests utilize:

Intel MPI Benchmarks: Used to measure the actual bandwidth and latency of the network interconnect.
High Performance Linpack (HPL) Suites: Used to verify floating-point performance and ensure the cluster is delivering its theoretical compute power.

By combining these tests with the Prometheus and Grafana monitoring stack, administrators can identify hardware malfunctions or network bottlenecks immediately after deployment.

Comparative Summary of Slurm Automation Components

Component	Purpose	Key Technical Detail
Ansible Roles	Configuration Management	Uses YAML hashes for `slurm_config` and `slurm_cgroup_config`
OpenTofu	Infrastructure Provisioning	Manages OpenStack volumes and instances as code
Packer	Image Creation	Builds standardized compute node images
Prometheus	Metrics Collection	Scrapes hardware and OS-level data via node-exporters
ElasticSearch	Log Management	Containerized storage for Slurm and system logs
Grafana	Visualization	Provides job-specific dashboards and node metrics
Open OnDemand	User Interface	Web-based portal for job submission and management

Conclusion

The integration of Slurm with Ansible transforms the deployment of High-Performance Computing clusters from a manual art form into a disciplined engineering process. By utilizing a modular "appliance" approach, administrators can deploy a fully integrated stack containing not only the workload manager but also the necessary accounting, monitoring, and user-interface components. The transition to a software-defined model—where infrastructure is managed via OpenTofu, images are created via Packer, and configurations are applied via Ansible—provides the flexibility to move between cloud and bare-metal environments without redefining the core operational logic. This architecture ensures that production-ready defaults for memory and access are maintained, while the inclusion of automated post-deployment MPI testing guarantees that the cluster is performing at its theoretical maximum. The move toward this automated, modular ecosystem is essential for any organization seeking to maintain scalable and reliable HPC resources.